# Production-Ready Reinforcement Learning

Let's time to bridge the gap between theoretical knowledge and real-world application. This notebook
introduces the essential techniques and tools for making RL agents production-ready.

### What We'll Cover:

1.  **Code Structure & Refactoring**: How to structure code to be modular and reusable.
2.  **Metrics & Monitoring**: Using Tensorboard to track losses, rewards, and gradients.
3.  **Checkpointing**: Saving and loading models to resume training or for inference.
4.  **Debugging & Troubleshooting**: Common techniques for when your RL agent isn't learning.
5.  **The Importance of Multiple Seeds**: Ensuring your results are robust and reproducible.
6.  **Scaling with Parallelization (Ray)**: Speeding up training with parallel environments.
7.  **Automated Hyperparameter Tuning (Optuna)**: Finding the best hyperparameters automatically.
8.  **Other Production Techniques**: A look at MLOps, CI/CD, and Safe RL.

Let's dive in!


### Imports and Setup

First, let's import the necessary libraries. We'll be using PyTorch, Gymnasium, and several new
libraries for our production techniques.


In [None]:
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
from torch.utils.tensorboard import SummaryWriter

import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import random
import os
import ray
import optuna
import copy

from util.gymnastics import gym_simulation, init_random

In [None]:
RUNS_DIR = "runs"
CHECKPOINTS_DIR = "checkpoints"

# Set up a directory for our logs and models
os.makedirs(RUNS_DIR, exist_ok=True)
os.makedirs(CHECKPOINTS_DIR, exist_ok=True)

# Init random seed of environment.
init_random()

### A Note on Environment Choice: CartPole

`CartPole` provides a **dense reward** (+1 for every timestep the pole is balanced), which allows
our agent to learn quickly. This rapid feedback loop makes it much easier to see the effects of our
tooling and techniques without waiting a long time for the agent to solve a complex exploration
problem.


In [None]:
ENV_NAME = "CartPole-v1"
TEMP_ENV = gym.make(ENV_NAME)
STATE_DIM = TEMP_ENV.observation_space.shape[0]
ACTION_DIM = TEMP_ENV.action_space.n
TEMP_ENV.close()

In [None]:
gym_simulation(ENV_NAME)

---


### Part 1: Modular Code Structure - An A2C Trainer

A key principle of production-ready code is **modularity**. Instead of rewriting the training loop
in every section, we will encapsulate the logic into a reusable `A2C_Trainer` class. This makes the
code cleaner, easier to maintain, and less prone to errors.

Our trainer will handle:

- The Actor-Critic model and optimizer.
- A single step of the A2C algorithm.
- Logic for running a full episode.


In [None]:
class ActorCritic(nn.Module):
    """The Actor-Critic neural network model."""

    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(ActorCritic, self).__init__()
        # TODO: Create a Sequential module for the actor: linear, relu, linear, softmax
        self.actor = None
        # TODO: Create a Sequential module for the critic: linear, relu, linear
        self.critic = None

    def forward(self, state):
        # TODO: Pass the state through the actor to get the policy
        policy = None
        # TODO: Pass the state through the critic to get the value
        value = None
        return policy, value

    @torch.no_grad()
    def act(self, state):
        # TODO: Implement the act method for inference
        # 1. Convert state to a tensor and add a batch dimension
        # 2. Get the policy from the model
        # 3. Return the action with the highest probability (use torch.argmax)
        return 0  # Placeholder


class A2C_Trainer:
    """A trainer class to encapsulate the A2C training logic."""

    def __init__(self, state_dim, action_dim, lr=0.001, gamma=0.99, hidden_dim=128):
        self.model = ActorCritic(state_dim, action_dim, hidden_dim)
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
        self.gamma = gamma

    def train_step(self, log_probs, values, rewards):
        """Performs a single training update."""
        # TODO: Calculate discounted returns (R). Hint: Iterate backwards through rewards.
        returns = []

        # TODO: Convert returns and values to tensors. Squeeze values.
        returns = None
        values = None

        # TODO: Calculate advantages. Detach values to stop gradients.
        advantages = None

        # TODO: Calculate actor loss.
        actor_loss = None

        # TODO: Calculate critic loss using Mean Squared Error.
        critic_loss = None

        # TODO: Calculate the total loss.
        loss = None

        # TODO: Perform backpropagation.
        # 1. Zero gradients
        # 2. Backward pass
        # 3. Clip gradients
        # 4. Optimizer step

        return 0.0  # Placeholder for loss.item()

    def run_episode(self, env, seed: int = None):
        """Runs a single episode and collects trajectories."""
        if seed is None:
            seed = random.randrange(100_000)

        state, _ = env.reset(seed=seed)
        done = False
        total_reward = 0
        log_probs, values, rewards = [], [], []

        while not done:
            # TODO: Get policy and value from the model.
            policy, value = None, None

            # TODO: Create a Categorical distribution and sample an action.
            dist = None
            action = None

            # TODO: Step the environment with the sampled action.
            next_state, reward, done, _, _ = env.step(0)  # Placeholder

            # TODO: Store the log probability of the action, the value, and the reward.
            # ...

            state = next_state
            total_reward += reward

        return log_probs, values, rewards, total_reward

---


### Part 2: Metrics and Monitoring with Tensorboard

**Tensorboard** is a powerful visualization toolkit for understanding, debugging, and optimizing
machine learning experiments. It allows us to log and visualize key metrics in real-time.

We will track:

- **Rewards**: The most important metric. Is our agent learning?
- **Losses**: How well our model is learning to predict values and update its policy.
- **Gradients**: The magnitude of our gradients. This helps diagnose issues like _vanishing_ or
  _exploding_ gradients.


In [None]:
def train_with_monitoring(seed, n_episodes=1000, lr=0.001, gamma=0.99):
    env = gym.make(ENV_NAME)
    # TODO: Create the trainer.
    trainer = None

    # Setup directories and Tensorboard writer
    run_name = f"a2c_seed_{seed}"
    os.makedirs(os.path.join(CHECKPOINTS_DIR, run_name), exist_ok=True)
    # TODO: Create the SummaryWriter in RUNS_DIR/run_name.
    writer = None

    episode_rewards = []
    best_avg_reward = -np.inf

    for episode in range(n_episodes):
        # TODO: Run an episode with the trainer.
        log_probs, values, rewards, total_reward = None
        # TODO: Perform one training step.
        loss = None

        episode_rewards.append(total_reward)

        # TODO: Log the total loss and episode reward to TensorBoard.
        # Hint: use writer.add_scalar("Tag/Name", value, global_step)

        for name, param in trainer.model.named_parameters():
            if param.grad is not None:
                # TODO: Log the gradients for each parameter.
                # Hint: use writer.add_histogram(f"Gradients/{name}", param.grad, episode)
                pass

        # Save the best model based on a moving average of rewards
        if len(episode_rewards) > 100:
            # TODO: Compute the average reward as the mean of the last 100 episodes.
            avg_reward = None
            # TODO: Log the moving_avg_reward.
            # ..
            if avg_reward > best_avg_reward:
                best_avg_reward = avg_reward
                best_ckpt_path = os.path.join(CHECKPOINTS_DIR, run_name, "model_best.pth")
                # TODO: Save the model's state dictionary.
                # Hint: use torch.save(trainer.model.state_dict(), path)

    print(f"Finished training for seed {seed}.")
    writer.close()
    env.close()
    return episode_rewards


# Run the training for one seed
rewards_seed_1 = train_with_monitoring(seed=42)

# To view the logs, run this command in your terminal:
# tensorboard --logdir=runs

![TensorBoard](assets/12_PROD_tensorboard.png) <br><small>TensorBoard UI. This is what you should
see when multiple seeds are logged.</small>


---


### Part 3: Checkpointing (Saving & Loading Models)

**Checkpointing** is the process of saving the state of your model during training. This is crucial
for several reasons:

- **Resuming Training**: If your training process is interrupted, you can resume from the last saved
  checkpoint instead of starting over.
- **Inference**: Once you have a trained model, you need to save it to use it later for making
  predictions.
- **Best Model**: You can save the model that achieved the best performance, which might not be the
  one from the very last epoch.

Our `train_with_monitoring` function already saves the best-performing model. Now, let's see how to
load it and watch our trained agent perform.


In [None]:
# Path to the best model saved from our previous run
best_model_path = f"{CHECKPOINTS_DIR}/a2c_seed_42/model_best.pth"
# Instantiate the model directly
agent_model = ActorCritic(STATE_DIM, ACTION_DIM)
# TODO: Load the saved weights from best_model_path.
# Hint: use agent_model.load_state_dict(torch.load(path))

# TODO: Set the model to evaluation mode. This is CRITICAL for inference.
# Hint: use agent_model.eval()

# Pass the model (which now has an .act method) to your simulation
gym_simulation(ENV_NAME, agent_model)

---


### Part 4: Debugging and Troubleshooting in RL

Debugging RL algorithms is notoriously difficult because the agent's behavior, the environment, and
the learning algorithm are all tightly coupled. A bug might not cause a crash; instead, it might
just lead to the agent not learning. Here are some common problems and techniques to debug them.

#### Problem 1: The agent is not learning (flat reward curve).

- **Check Your Environment**: Is the observation space correct? Is the `done` signal being triggered
  appropriately?
- **Learning Rate is Too High/Low**: A very high learning rate can cause the policy to become
  chaotic, while a very low one can lead to painfully slow learning. This is where hyperparameter
  tuning (Part 7) becomes essential.
- **Bug in Reward Calculation**: Double-check your reward logic. In our A2C implementation, this is
  the calculation of `returns` and `advantages`. An off-by-one error or a mistake in the `gamma`
  application can kill learning.
- **Check Action Distribution**: Your policy should not collapse to a single action too quickly. Log
  the entropy of your policy distribution. If entropy drops to zero, the agent has stopped exploring
  and may be stuck in a suboptimal policy.

```python
  # Inside the training loop, after creating the distribution:
  entropy = dist.entropy().item()
  writer.add_scalar('Policy/entropy', entropy, episode)
```

#### Problem 2: Training is very unstable (reward goes up and down wildly).

- **High Learning Rate**: This is a classic symptom. The policy updates are too large, causing the
  agent to overshoot good policies.
- **Small Batch Size / High Variance**: In policy gradient methods, updates can have high variance.
  In our simple A2C, each episode is one "batch". If episodes are very short, the gradient estimates
  can be noisy. You can mitigate this by accumulating gradients over several episodes before
  performing an optimizer step.
- **Gradient Clipping**: Unstable training can lead to exploding gradients. We already added
  `torch.nn.utils.clip_grad_norm_`, which is a standard technique to prevent this by capping the
  magnitude of the gradients.
- **Value Function is not learning**: If the critic (`value` network) provides poor estimates of the
  state value, the `advantages` will be noisy, leading to unstable policy updates. Check the
  `critic_loss` in Tensorboard. If it's not decreasing, there might be an issue with your critic's
  architecture or learning rate.

#### General Debugging Tips:

- **Start Simple**: Always start with the simplest possible environment (like `CartPole`) and a
  known, stable algorithm before moving to more complex problems.
- **Sanity Check Model Outputs**: Before training, pass a dummy state through your model and check
  the shapes and value ranges of the output policy and value. The policy should be a valid
  probability distribution.
- **Read the Paper**: If you are implementing an algorithm from a paper, read it carefully. Small
  implementation details can make a huge difference.


---


### Part 5: The Importance of Multiple Seeds

RL algorithms can be very sensitive to the random seed, which affects weight initialization and
environment randomness. A great result on a single seed might just be luck. To get a reliable
estimate of an agent's performance, you **must** run experiments with multiple seeds and analyze the
aggregated results (mean and standard deviation).


In [None]:
def plot_rewards(rewards_list):
    """Plots the mean and standard deviation of rewards from multiple seeds."""
    plt.figure(figsize=(12, 6))

    # Transpose the list of lists to easily calculate stats across seeds
    rewards_array = np.array(rewards_list)
    mean_rewards = np.mean(rewards_array, axis=0)
    std_rewards = np.std(rewards_array, axis=0)

    plt.plot(mean_rewards, label="Mean Reward", color="blue")
    plt.fill_between(
        range(len(mean_rewards)),
        mean_rewards - std_rewards,
        mean_rewards + std_rewards,
        color="blue",
        alpha=0.2,
        label="Standard Deviation",
    )

    plt.title("A2C Performance on CartPole-v1 (3 Seeds)")
    plt.xlabel("Episode")
    plt.ylabel("Total Reward")
    plt.legend()
    plt.grid(True)
    plt.show()


# Run training with multiple seeds
seeds = [42, 123, 789]
all_rewards = []
for seed in seeds:
    # Using a shorter run for the multi-seed demonstration
    rewards = train_with_monitoring(seed=seed, n_episodes=500)
    all_rewards.append(rewards)

# Plot the results
plot_rewards(all_rewards)

---


### Part 6: Scaling with Parallelization (Ray)

#### How does Ray work?

**Ray** is a framework for distributed computing that makes it simple to scale Python applications
across multiple cores or machines. Its core philosophy is to turn regular Python functions and
classes into distributable, asynchronous tasks.

We will use two fundamental primitives from Ray Core:

1.  `@ray.remote`: A decorator that turns a Python class or function into a remote object or task
    that can be executed on a separate worker process.
2.  `ray.put()` and `ray.get()`: These functions are used to efficiently transfer objects (like our
    model's weights) to Ray's distributed object store and retrieve results from our remote workers.

Our strategy will be to create several remote **`RolloutWorker`** actors. The main training loop
will send the latest model weights to these workers, who will then independently collect experience
(run episodes) in parallel. The main loop then gathers this experience to perform a single, larger
update to the model.


In [None]:
@ray.remote
class RolloutWorker:
    """A remote actor for collecting experience in parallel."""

    def __init__(self, env_name, seed):
        self.env = gym.make(env_name)
        self.seed = seed

    def run_episode(self, model_weights):
        # This worker needs its own model instance
        state_dim = self.env.observation_space.shape[0]
        action_dim = self.env.action_space.n
        model = ActorCritic(state_dim, action_dim)
        model.load_state_dict(model_weights)

        states, actions, rewards = [], [], []
        state, _ = self.env.reset(seed=self.seed)
        done = False

        # Collect a full trajectory
        while not done:
            # TODO: Get an action from the model (no gradients needed for workers).
            action = None

            # TODO: Step the environment with the action.
            next_state, reward, done, _, _ = self.env.step(action)

            # TODO: Store the state, action, and reward.

            state = next_state

        return states, actions, rewards

    def close(self):
        """Closes the worker's environment."""
        # TODO: Close the environment.
        pass


def train_a2c_with_ray(n_batches=250, n_workers=4, lr=0.001, gamma=0.99):
    trainer = A2C_Trainer(STATE_DIM, ACTION_DIM, lr=lr, gamma=gamma)
    writer = SummaryWriter(f"{RUNS_DIR}/a2c_ray")

    # TODO: Create remote workers using the RolloutWorker class.
    # Hint: [RolloutWorker.remote(...) for ...]
    workers = []

    for batch_idx in range(n_batches):
        # TODO: Put the model's state_dict into Ray's object store.
        # Hint: ray.put(...)
        model_weights_id = None

        # TODO: Call run_episode remotely on all workers.
        # Hint: [w.run_episode.remote(...) for w in workers]
        futures = []

        # TODO: Get the results from the remote calls.
        # Hint: ray.get(...)
        results = []

        batch_reward = 0
        trainer.optimizer.zero_grad()

        for states, actions, rewards in results:
            batch_reward += sum(rewards)

            # TODO: This inner loop calculates the loss for one trajectory
            # and accumulates its gradients in the central model.
            # The logic is very similar to A2C_Trainer.train_step, but you'll
            # need to re-evaluate the policy and values on the central model
            # to get tensors with a grad_fn.

            # 1. Convert data to tensors
            # 2. Re-evaluate policy and values on trainer.model
            # 3. Get log_probs
            # 4. Calculate returns and advantages
            # 5. Calculate actor and critic loss
            # 6. Calculate total loss and scale by n_workers
            # 7. Call loss.backward() to accumulate gradients
            pass

        # TODO: After all gradients are accumulated, take one optimizer step.
        # ...

        avg_reward = batch_reward / n_workers
        writer.add_scalar("Reward/avg_episode_reward", avg_reward, batch_idx * n_workers)
        if (batch_idx + 1) % 25 == 0:
            print(f"Batch {batch_idx+1}/{n_batches}, Avg Reward: {avg_reward:.2f}")

    writer.close()
    print("Closing remote worker environments...")
    ray.get([w.close.remote() for w in workers])
    print("Ray training finished.")


# It's good practice to restart Ray between runs if in an interactive environment
if ray.is_initialized():
    ray.shutdown()
ray.init(ignore_reinit_error=True)

# Run the parallel training
train_a2c_with_ray()

ray.shutdown()

---


### Part 7: Automated Hyperparameter Tuning (Optuna)

#### How does Optuna work?

**Optuna** is a hyperparameter optimization framework that automates the search for the best model
settings. It uses a "define-by-run" API that makes it highly flexible and Pythonic.

The core concepts we will use are:

1.  **Study**: A `study` object manages an entire optimization task. We define the goal (e.g.,
    `direction='maximize'`).
2.  **Trial**: A `trial` represents a single execution of our training process with a specific set
    of hyperparameters. Inside our objective function, we ask the `trial` object to `suggest` values
    for each hyperparameter (e.g., `trial.suggest_float('lr', ...)`).
3.  **Objective Function**: This is a function that Optuna will call repeatedly. It takes a `trial`
    object as input, runs our training, and returns a performance score (e.g., the average reward),
    which Optuna then tries to maximize or minimize.

Optuna uses intelligent sampling algorithms (like TPE) to choose which hyperparameter combinations
to try next, making it much more efficient than a simple grid search.


In [None]:
def objective(trial):
    """The objective function for Optuna to optimize."""
    # TODO: Suggest hyperparameters using the trial object.
    # lr: float between 1e-4 and 1e-2 (log scale)
    # gamma: float between 0.9 and 0.999 (log scale)
    # hidden_dim: categorical choice from [64, 128, 256]
    lr = 0.001
    gamma = 0.99
    hidden_dim = 128

    # Set up environment and trainer
    env = gym.make(ENV_NAME)
    trainer = A2C_Trainer(STATE_DIM, ACTION_DIM, lr, gamma, hidden_dim)

    episode_rewards = []
    # Use fewer episodes for faster tuning trials
    for episode in range(150):
        # TODO: Run episode with the trainer
        log_probs, values, rewards, total_reward = None
        trainer.train_step(log_probs, values, rewards)
        episode_rewards.append(total_reward)

        # TODO: Report the current performance to the trial for pruning.
        # Hint: trial.report(metric, step=episode)

        # TODO: Check if the trial should be pruned.
        # Hint: if trial.should_prune(): raise optuna.exceptions.TrialPruned()
        pass

    env.close()

    # TODO: Return the final performance metric to be optimized.
    # Hint: np.mean of the last 50 episode rewards
    return 0.0


# TODO: Create an Optuna study. Set direction to 'maximize' and use a pruner.
# Hint: use optuna.create_study(...)
study = None

# TODO: Start the optimization. Hint: use study.optimize, use n_trials=10 and timeout=120
# ...

print(f"Best trial value: {study.best_trial.value}")
print(f"Best params: {study.best_params}")

if optuna.visualization.is_available():
    fig1 = optuna.visualization.plot_optimization_history(study)
    fig1.show()
    fig2 = optuna.visualization.plot_param_importances(study)
    fig2.show()

---


### Part 8: Other Production Techniques

While we've covered the core hands-on tools, building a true production RL system involves a broader
set of MLOps (Machine Learning Operations) practices:

- **Configuration Management**: For larger projects, hard-coding hyperparameters is not scalable.
  Using configuration files (e.g., YAML or JSON) allows you to manage settings for different
  experiments cleanly.

- **Model Versioning and Deployment**: In a production environment, you need a system for versioning
  your models, tracking their performance, and deploying them to serve actions. Tools like
  **MLflow** are excellent for experiment tracking and model management, while **Kubeflow** can
  orchestrate entire ML pipelines on Kubernetes.

- **Continuous Integration/Continuous Deployment (CI/CD)**: A CI/CD pipeline automates the process
  of testing code, training models, and deploying them. For RL, this might mean a pipeline that
  automatically runs a suite of tests, triggers a new training run with multiple seeds, evaluates
  the resulting model against a baseline, and, if it's better, promotes it to production.

- **Safe Exploration**: In real-world applications like robotics or autonomous driving, a wrong
  action during exploration can be costly or dangerous. **Safe RL** is an entire subfield dedicated
  to this problem. Techniques include using a safety layer that overrides unsafe actions, or
  employing constrained optimization to ensure the agent's policy does not violate certain safety
  criteria during updates.


---


### Conclusion & Next Steps

In this notebook, we've elevated our RL code from a simple script to a more robust,
production-oriented setup. We've seen how to:

- **Structure** code for reuse.
- **Monitor** training to gain critical insights.
- **Checkpoint** models for safety and deployment.
- **Debug** common and frustrating RL issues.
- **Validate** results by running multiple seeds.
- **Scale** data collection with parallelization.
- **Automate** the search for optimal hyperparameters.

These techniques are the building blocks for applying Reinforcement Learning to solve real-world
problems. Happy training! 🚀
