# **FRA 503: Deep Reinforment Learning**
**HW 3 Cartpole Function Approximation**

Napat Aeimwiratchai 65340500020  
Phattarawat Kadrum 65340500074

**Learning Objectives:**

1. Understand how function approximation works and how to implement it.

2. Understand how policy-based RL works and how to implement it.

3. Understand how advanced RL algorithms balance exploration and exploitation.

4. Be able to differentiate RL algorithms based on stochastic or deterministic policies, as well as value-based, policy-based, or Actor-Critic approaches.

5. Gain insight into different reinforcement learning algorithms, including Linear Q-Learning, Deep Q-Network (DQN), the REINFORCE algorithm, and the Actor-Critic algorithm. Analyze their strengths and weaknesses.

# **Part 1 : Understanding the algorithm**

4 function approximation-based RL algorithms:
- Linear Q-Learning 
- Deep Q-Network (DQN)
- MC REINFORCE algorithm
- Actor-Critic method (A2C)

Value-based / Policy based / Actor-Critic approch  
Specify type policy it learns(stochastic or deterministic)  
Identity the type of observation space and action space (discrete or continuous)   
Explain how each advanced RL method balances exploration and exploitation. 


## **Linear Q-Learning** 

- Approach Type: Value-based
- Policy Type: Deterministic (ε-greedy)
- Observation Space: Discrete or low dimensional continuous 
- Action Space: Discrete
- Explore vs Exploitation:
    - uses ε-greedy
    - with probability ε => exploration
    - with probability 1-ε => exploitaion


Linear Q-Learning (with TD) **Q-Update Rule:** $$ q_{t+1}(S_t, A_t) \leftarrow q_t(S_t, A_t) + \alpha_t \left[ R_{t+1} + \gamma \max_a q_t(S_{t+1}, a) - q_t(S_t, A_t) \right] $$ 

**Linear Function Approximation:** $$ q(s, a) \approx \mathbf{w}^\top \phi(s, a) $$ 


**Weight Update:** $$ \mathbf{w}_{t+1} \leftarrow \mathbf{w}_t + \alpha_t \, \delta_t \, \nabla_{\mathbf{w}} q(s, a) $$ 


Where Loss: $$ \delta_t = R_{t+1} + \gamma \max_{a'} q(S_{t+1}, a') - q(S_t, A_t) $$

## **Deep Q-Network (DQN)**

- Approach Type: Value-based
- Policy Type: Deterministic (ε-greedy)
- Observation Space: High dimensional (continuous)
- Action Space: Discrete
- Explore vs Exploitation:
    - uses ε-greedy
    - Experience Replay: decorrelates samples
    - Target Networks: stabilize Q-value update


Deep Q-Network (DQN) **Loss Function:** $$ L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D} \left[ \left( r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a) \right)^2 \right] $$ **Target Q-Value:** $$ y = r + \gamma \max_{a'} Q_{\theta^-}(s', a') $$ **Gradient Update:** $$ \theta \leftarrow \theta - \alpha \nabla_\theta L(\theta) $$ Where: 
- $\theta$: parameters of the Q-network  
- $\theta^-$: parameters of the target network (updated periodically)  
- $D$: experience replay buffer  


## **Monte Carlo REINFORCE**

- Approach Type: Policy-based
- Policy Type: Stochastic
- Observation Space: Discrete or Continuous 
- Action Space: Discrete or Continuous
- Explore vs Exploitation:
    - Exploration from stochastic policy (softmax or Gaussian)
    - No need ε-greedy
    - Monte Carlo (update at episode end) => high variance
    - Does not bootstrap (no value function), slow learning but avoids bias
    


REINFORCE (Monte Carlo Policy Gradient) **Policy Gradient Update:** $$ \theta_{t+1} \leftarrow \theta_t + \alpha_t \, G_t \, \nabla_\theta \log \pi_\theta(a_t | s_t) $$ **Return from timestep \( t \):** $$ G_t = \sum_{k=0}^{T - t - 1} \gamma^k R_{t + k + 1} $$

## **Actor-Critic (A2C: Advantage Actor-Critic)**

- Approach Type: Actor-Critic (Hybrid)
- Policy Type: Stochastic (Actor outputs probability distribution over actions)
- Observation Space: Discrete or Continuous 
- Action Space: Discrete or Continuous

- Explore vs Exploitation:
    - Stochastic policy: Actions are sample from probability distributions 

- Critic (value function) as a baseline: Reduce variance of policy gradient updates (used to compute Advantage)
- Bootstrapping & TD learning
- A2C uses TD(0) or n-step return updates
- Update directly from gradients


**A2C (Advantage Actor-Critic)**

$$
L^{\text{A2C}}(\theta) = -\mathbb{E}_t \left[ \log \pi_\theta(a_t | s_t) \cdot \hat{A}_t \right]
$$

Where:

$$
\hat{A}_t = r_t + \gamma V_{\phi}(s_{t+1}) - V_{\phi}(s_t)
$$


Critic value loss:

$$
L^{\text{VF}}(\phi) = \mathbb{E}_t \left[ \left( V_\phi(s_t) - R_t \right)^2 \right]
$$

Entropy bonus (to encourage exploration):

$$
L^{\text{ENT}}(\theta) = \mathbb{E}_t \left[ \mathcal{H} \left[ \pi_\theta(\cdot | s_t) \right] \right]
$$

Where \( c_v \) and \( c_e \) are the coefficients for the value loss and entropy bonus respectively.


### **Comparison Table**
|Algorithm|Approach Type|Policy Type|Observation Space|Action Space|Exploration vs Exploitation|
|---|---|---|---|---|---|
|Linear Q|Value based|Deterministic (ε-greedy)|Discrete|Discrete|ε-greedy|
|DQN|Value based|Deterministic (ε-greedy)|Continuous|Discrete|ε-greedy + Replay Buffer + Target Network|
|MC REINFORCE|Policy based|Stochastic|Discrete/Continuous|Discrete/Continuous|Stochastic Policy|
|A2C|Actor Critic|Stochastic|Discrete/Continuous|Discrete/Continuous|Stochastic policy + Estimated Advantage + baseline

# **Part 2 : Setting up `Cart-Pole` Agent**

This part include: 
1. RL Base Class
2. Replay Buffer Class
3. Algorithm folder

### 1.RL Base class 

This class include:

- **Constructor `(__init__)`** to initialize the following parameters:

    - **Number of actions**: The total number of discrete actions available to the agent.

    - **Action range**: The minimum and maximum values defining the range of possible actions.

    - **Learning rate**: Determines how quickly the model updates based on new information.

    - **Initial epsilon**: The starting probability of taking a random action in an ε-greedy policy.

    - **Epsilon decay rate**: The rate at which epsilon decreases over time to favor exploitation over exploration.

    - **Final epsilon**: The lowest value epsilon can reach, ensuring some level of exploration remains.

    - **Discount factor**: A coefficient (γ) that determines the importance of future rewards in decision-making.

    - **Buffer size**: Maximum number of experiences the buffer can hold.

    - **Batch size**: Number of experiences to sample per batch.

- **Core Functions**
    - `scale_action()`: scale the discrete action in range [0,n] to [action_min, action_max] range.
    ```python
    scaled = self.action_range[0] + (self.action_range[1] - self.action_range[0]) * (action / (self.num_of_action - 1))
    ```

    - `decay_epsilon()`: Decreases epsilon over time by inverse exponential
    ```python
    self.epsilon = max(self.final_epsilon, self.epsilon * self.epsilon_decay)
    ```



### 2. Replay Buffer Class

A class use to store state, action, reward, next state, and termination status from each timestep in episode to use as a dataset to train neural networks. This class include:

- **Constructor `(__init__)`** to initialize the following parameters:
  
    - **memory**: FIFO buffer to store the trajectory within a certain time window.
  
    - **batch_size**: Number of data samples drawn from memory to train the neural network.

- **Core Functions**
  
    - `add()`: Add state, action, reward, next state, and termination status to the FIFO buffer. Discard the oldest data in the buffer
    ```python
    # Create a named tuple from collections module for 5 arguments (state, action, next_state, reward, and done)
    Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward', 'done'))
    ```
    - `sample()`: Sample data from memory to use in the neural network training.
    
 
  Note that some algorithms may not use all of the data mentioned above to train the neural network.

### 3. Algorithm folder

This folder include:

- Linear Q Learning class
- Deep Q-Network class
- REINFORCE class
- A2C class

Each class inherit from the `BaseAlgorithm` in `RL_base_function.py` and include:

- A constructor which initializes the same variables as the class it inherits from

- Superclass Initialization (super().__init__())

- An update() function that updates the agent’s learnable parameters and advances the training step.

- A select_action() function select the action according to current policy.

- A learn() function that train the regression or neural network.

#### Linear Q-Learning class

```python
class Linear_QN(BaseAlgorithm):
    def __init__(
            self,
            num_of_action: int = 2,
            action_range: list = [-2.5, 2.5],
            learning_rate: float = 0.01,
            initial_epsilon: float = 1.0,
            epsilon_decay: float = 1e-3,
            final_epsilon: float = 0.001,
            discount_factor: float = 0.95,
    ) -> None:     
        self.episode_durations = []

        super().__init__(
            num_of_action=num_of_action,
            action_range=action_range,
            learning_rate=learning_rate,
            initial_epsilon=initial_epsilon,
            epsilon_decay=epsilon_decay,
            final_epsilon=final_epsilon,
            discount_factor=discount_factor,
        )

    def update(
        self,
        obs,
        action: int,
        reward: float,
        next_obs,
        next_action: int,
        terminated: bool
    ):
        obs = np.array(obs)
        next_obs = np.array(next_obs)

        # Compute current Q-value for taken action
        current_q = np.dot(obs, self.w[:, action])

        # Compute TD target: reward if terminal, otherwise Bellman update
        if terminated:
            target_q = reward
        else:
            next_q_value = np.dot(next_obs, self.w)
            target_q = reward + self.discount_factor * np.max(next_q_value)

        # TD error (difference between target and current estimate)
        td_error = target_q - current_q

        # Weight update using gradient descent on TD error
        self.w[:, action] += self.lr * td_error.item() * obs

        # Log training error
        self.training_error.append(td_error.item() ** 2)
        # ====================================== #

    def select_action(self, state):
        state = np.array(state)
        # Explore with probability ε; otherwise exploit
        if np.random.rand() < self.epsilon:
            action = np.random.randint(self.num_of_action)
        else:
            action = np.argmax(np.dot(state, self.w))
        return torch.tensor([[action]], dtype=torch.int64)        # return int(action)
    
    def learn(self, env, max_steps):
        state = env.reset()     # Reset environment
        total_reward = 0.0
        done = False
        step = 0

        while not done and step < max_steps:
            # Extract current observation
            obs = state[0]['policy'][0].cpu().numpy()

            # Select action using current policy
            action_index = self.select_action(obs)

            # Scale discrete action to environment's action range
            scaled_action = self.scale_action(action_index)

            # Execute action in the environment
            next_state, reward, done, _, __ = env.step(scaled_action)

            # Extract next observation
            next_obs = next_state['policy'][0].cpu().numpy()

            # Select next action
            next_action_index = self.select_action(next_obs)

            # Update Q-values 
            self.update(obs, action_index, reward, next_obs, next_action_index, done)

            # Prepare for next iteration
            obs = next_obs
            total_reward += reward
            step += 1
            if done:
                break
        # Decay exploration rate ε
        self.decay_epsilon()
        return total_reward, step
```

#### Deep Q-Network class

```python
# Neural Network definition for DQN policy and target networks
class DQN_network(nn.Module):
    def __init__(self, n_observations, hidden_size, n_actions, dropout):
        super(DQN_network, self).__init__()

        # Define a simple feedforward neural network
        self.fc1 = nn.Linear(n_observations, hidden_size)
        self.dropout = nn.Dropout(dropout)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.output = nn.Linear(hidden_size, n_actions)

    def forward(self, x):
        # Forward pass with ReLU activations and dropout
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        return self.output(x)   # Output Q-values for all actions

class DQN(BaseAlgorithm):
    def __init__(
            self,
            device = None,
            num_of_action: int = 2,
            action_range: list = [-2.5, 2.5],
            n_observations: int = 4,
            hidden_dim: int = 64,
            dropout: float = 0.5,
            learning_rate: float = 0.01,
            tau: float = 0.005,
            initial_epsilon: float = 1.0,
            epsilon_decay: float = 1e-3,
            final_epsilon: float = 0.001,
            discount_factor: float = 0.95,
            buffer_size: int = 1000,
            batch_size: int = 1,
    ) -> None:
        # Initialize main (policy) and target networks
        self.policy_net = DQN_network(n_observations, hidden_dim, num_of_action, dropout).to(device)
        self.target_net = DQN_network(n_observations, hidden_dim, num_of_action, dropout).to(device)
        self.target_net.load_state_dict(self.policy_net.state_dict())

        # Setup training components
        self.device = device
        self.steps_done = 0
        self.num_of_action = num_of_action
        self.tau = tau

        # Optimizer for updating policy network weights
        self.optimizer = optim.AdamW(self.policy_net.parameters(), lr=learning_rate, amsgrad=True)

        # Experience replay parameters
        self.episode_durations = []
        self.buffer_size = buffer_size
        self.batch_size = batch_size

        # Initialize base RL settings
        super(DQN, self).__init__(
            num_of_action=num_of_action,
            action_range=action_range,
            learning_rate=learning_rate,
            initial_epsilon=initial_epsilon,
            epsilon_decay=epsilon_decay,
            final_epsilon=final_epsilon,  
            discount_factor=discount_factor,
            buffer_size=buffer_size,
            batch_size=batch_size,
        )

    def select_action(self, state):
        # Selects an action using ε-greedy policy
        sample = random.random()
        if sample > self.epsilon:
            # probability ε: choose random action (exploration)
            with torch.no_grad():
                return self.policy_net(state.to(self.device)).argmax(dim=1).view(1, 1)
        else:
            # choose action with highest Q-value from policy_net (exploitation)
            return torch.tensor([[random.randrange(self.num_of_action)]], device=self.device, dtype=torch.long)

    def calculate_loss(self, non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch):
        # Q(s,a) from policy network
        state_action_values = self.policy_net(state_batch).gather(1, action_batch)

        # Estimate Q(s', a') for non-terminal states using target network
        if non_final_next_states.dim() == 1:
            non_final_next_states = non_final_next_states.unsqueeze(0)
    
        next_state_values = torch.zeros(self.batch_size, device=self.device)
        next_state_values[non_final_mask] = self.target_net(non_final_next_states).max(1)[0].detach()

        # Compute expected Q value
        expected_state_action_values = (next_state_values * self.discount_factor) + reward_batch.squeeze()
        expected_state_action_values = expected_state_action_values.unsqueeze(1)

        # Loss = Huber loss between expected and actual Q-values
        loss = F.smooth_l1_loss(state_action_values, expected_state_action_values)
        return loss

    def update_policy(self):
        sample = self.generate_sample(self.batch_size)
        non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch = sample
        
        # Compute loss
        loss = self.calculate_loss(non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch)

        # Backpropagation and optimization
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_value_(self.policy_net.parameters(), 100)
        self.optimizer.step()

        return loss.item()

    def update_target_networks(self):

        # Retrieve the state dictionaries (weights) of both networks
        policy_state_dict = self.policy_net.state_dict()
        target_state_dict = self.target_net.state_dict()
        policy_items = list(policy_state_dict.items())
        target_items = list(target_state_dict.items())

        updated_state_dict = {}

        for (t_key, t_param), (p_key, p_param) in zip(target_items, policy_items):
            assert t_key == p_key, "Mismatch in parameter keys between policy and target networks."
            updated_param = self.tau * p_param + (1.0 - self.tau) * t_param
            updated_state_dict[t_key] = updated_param
        
        # Load the updated weights into the target network
        self.target_net.load_state_dict(updated_state_dict)

    def learn(self, env):
    
        obs, _ = env.reset()
        obs_array = obs["policy"] if isinstance(obs, dict) else obs
        state = torch.tensor(obs_array, dtype=torch.float32).to(self.device) 

        total_reward = 0
        done = False
        timestep = 0

        while not done:
            # Action selection
            action = self.select_action(state)

            # Execute action in environment
            scaled_action = self.scale_action(action.item()) 
            next_obs, reward, terminated, truncated, _ = env.step(scaled_action)
            done = terminated or truncated

            next_obs_array = next_obs["policy"] if isinstance(next_obs, dict) else next_obs
            next_state = torch.tensor(next_obs_array, dtype=torch.float32).to(self.device)

            # Prepare data for memory
            reward = torch.tensor([reward], dtype=torch.float32).to(self.device)
            done_tensor = torch.tensor([done], dtype=torch.bool).to(self.device)

            # Store transition in replay buffer
            self.memory.add(state, action, reward, next_state, done_tensor)

            # Update state
            state = next_state
            total_reward += reward.item()

            # Train policy network on sampled minibatch
            td_loss = self.update_policy()

            # Update target network via Polyak averaging
            self.update_target_networks()

            timestep += 1
            if done:
                # Update exploration rate
                self.decay_epsilon()
        return total_reward, timestep, td_loss
```

#### MC REINFOCE class

```python
# Neural network for policy π(a|s) with softmax output
class MC_REINFORCE_network(nn.Module):
    def __init__(self, n_observations, hidden_size, n_actions, dropout):
        super(MC_REINFORCE_network, self).__init__()

        # Feedforward network with dropout and softmax over discrete actions
        self.model = nn.Sequential(
            nn.Linear(n_observations, hidden_size),  # 0
            nn.ReLU(),                               # 1
            nn.Dropout(dropout),                     # 2
            nn.Linear(hidden_size, n_actions),       # 3
            nn.Softmax(dim=-1)                       # 4
        )
        # Xavier initialization for better convergence
        for m in self.model:
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                nn.init.zeros_(m.bias)

    def forward(self, x):
        # Sequential forward pass through the policy network
        x = self.model[0](x)
        x = self.model[1](x)
        x = self.model[2](x)
        x = self.model[3](x)
        x = self.model[4](x)
        return x


class MC_REINFORCE(BaseAlgorithm):
    def __init__(
            self,
            device = None,
            num_of_action: int = 2,
            action_range: list = [-2.5, 2.5],
            n_observations: int = 4,
            hidden_dim: int = 64,
            dropout: float = 0.5,
            learning_rate: float = 0.001,
            discount_factor: float = 0.95,
    ) -> None:
        # Policy network and optimizer
        self.LR = learning_rate
        self.policy_net = MC_REINFORCE_network(n_observations, hidden_dim, num_of_action, dropout).to(device)
        self.optimizer = optim.AdamW(self.policy_net.parameters(), lr=learning_rate)

        self.device = device
        self.steps_done = 0
        self.episode_durations = []

        # Inherit base parameters
        super(MC_REINFORCE, self).__init__(
            num_of_action=num_of_action,
            action_range=action_range,
            learning_rate=learning_rate,
            discount_factor=discount_factor,
        )

    def calculate_stepwise_returns(self, rewards):
        R = 0
        returns = []

        # Compute G_t for each t in reverse
        for r in reversed(rewards):
            R = r + self.discount_factor * R
            returns.insert(0, R)

        returns = torch.tensor(returns, dtype=torch.float32).to(self.device)

        # Normalize returns to reduce variance 
        if returns.size(0) > 1:
            returns = (returns - returns.mean()) / (returns.std() + 1e-9)
        return returns 

    def generate_trajectory(self, env):
        state = env.reset()
        state = state["policy"] 
        state = torch.tensor(state, dtype=torch.float32).to(self.device)
        if state.dim() == 1:
            state = state.unsqueeze(0)

        log_prob_actions = []
        rewards = []
        trajectory = []
        done = False
        episode_return = 0.0
        timestep = 0
        
        while not done:
            # Compute action probabilities from current state
            probs = self.policy_net(state)
            m = distributions.Categorical(probs)    # Categorical distribution over actions

            action = m.sample() # Sample an action
            log_prob = m.log_prob(action)   

            # Scale the discrete action for continuous environments
            scaled_action = self.scale_action(action.item())

            # Step in environment
            next_state, reward, done, *_ = env.step(scaled_action)

            # Clean up different state formats
            if isinstance(next_state, tuple):
                next_state = next_state[0]
            if isinstance(next_state, dict):
                next_state = next_state["policy"] 
            next_state = torch.tensor(next_state, dtype=torch.float32).to(self.device)

            if next_state.dim() == 1:
                next_state = next_state.unsqueeze(0)
            
            # Record data
            log_prob_actions.append(log_prob)
            rewards.append(reward)
            trajectory.append((state, action, reward, next_state, done))
            
            # Prepare for next step
            state = next_state
            episode_return += reward
            timestep += 1
            if done:
                break

        # Compute returns from rewards and stack log_probs
        stepwise_returns = self.calculate_stepwise_returns(rewards)
        log_prob_actions = torch.stack(log_prob_actions)
        return episode_return, stepwise_returns, log_prob_actions, trajectory, timestep
    
    def calculate_loss(self, stepwise_returns, log_prob_actions):
        return -torch.sum(stepwise_returns * log_prob_actions)

    def update_policy(self, stepwise_returns, log_prob_actions):
        loss = self.calculate_loss(stepwise_returns, log_prob_actions)

        # Backpropagation
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), max_norm=1.0)
        self.optimizer.step()
        return loss.item()
    
    def learn(self, env):
        self.policy_net.train()
        episode_return, stepwise_returns, log_prob_actions, trajectory, ep_len = self.generate_trajectory(env)
        loss = self.update_policy(stepwise_returns, log_prob_actions)
        return episode_return, loss, trajectory, ep_len
```

#### A2C class

```python
class Actor(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim=1, learning_rate=1e-4):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.mu_head = nn.Linear(hidden_dim, output_dim)

        # Learnable log S.D. for continuous policy (Gaussian)
        self.log_std = nn.Parameter(torch.zeros(output_dim))  # log standard deviation
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
        self.init_weights()

    def init_weights(self):
        # Xavier (Glorot) initialization
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                nn.init.zeros_(m.bias)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        mu = self.mu_head(x)
        std = torch.exp(self.log_std)   # Convert log_std to std
        dist = Normal(mu, std)          # Gaussian policy
        return dist

# Critic Network (Value Function)
class Critic(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim=1, learning_rate=0.0001):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.value_head = nn.Linear(hidden_dim, output_dim)  # Value output
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
        self.init_weights()

    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                nn.init.zeros_(m.bias)

    def forward(self, state, action):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        value = self.value_head(x)
        return value

# Actor-Critic Agent
class Actor_Critic(BaseAlgorithm):
    def __init__(self, 
                 state_dim, 
                 num_of_action,
                 learning_rate,
                 action_range: list = [-2.5, 2.5],
                 hidden_dim=128, 
                 gamma=0.99, 
                 device=None):
        super().__init__(
            num_of_action=num_of_action,
            action_range=action_range,
            learning_rate=learning_rate,
            discount_factor=gamma,
        )

        self.device = device or torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.actor = Actor(state_dim, hidden_dim, output_dim=1, learning_rate=learning_rate).to(self.device)
        self.critic = Critic(state_dim, hidden_dim, learning_rate=learning_rate).to(self.device)

        self.gamma = gamma
        self.episode_durations = []
        self.action_range = action_range
        self.num_of_action = num_of_action

    def scale_action_AC(self, action):
        action_min, action_max = self.action_range
        scaled = action_min + (action_max - action_min) * ((action + 1) / 2)  # Assuming action ∈ (-∞, ∞)
        return torch.clamp(scaled, action_min, action_max)

    def select_action(self, state):
        state = torch.tensor(state, dtype=torch.float32).to(self.device)
        dist = self.actor(state)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        scaled_action = self.scale_action_AC(action)
        return scaled_action, log_prob


    def learn(self, env):
        state_dict, _ = env.reset()
        state = state_dict['policy'].to(self.device)

        done = False
        episode_return = 0
        log_probs = []
        values = []
        rewards = []
        timestep = 0  

        while not done:
            action, log_prob = self.select_action(state)
            value = self.critic(state, action)

            # Interact with environment
            action = action.squeeze(0)
            action_env = action.unsqueeze(0) if action.ndim == 1 else action
            next_state_dict, reward, terminated, truncated, _ = env.step(action_env)
            done = terminated or truncated
            next_state = next_state_dict['policy'].to(self.device)

            # Store transition data
            log_probs.append(log_prob)
            values.append(value)
            rewards.append(torch.tensor([reward], dtype=torch.float32, device=self.device))
            
            state = next_state
            episode_return += reward
            timestep += 1
            if done:
                break

        # Monte Carlo return computation
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        returns = torch.cat(returns).detach()

        # Compute Advantage
        values = torch.cat(values)
        advantage = returns - values.squeeze()

        # Actor loss
        actor_loss = -(torch.stack(log_probs) * advantage.detach()).mean()

        # Critic loss
        critic_loss = advantage.pow(2).mean()

        # Update actor
        self.actor.optimizer.zero_grad()
        actor_loss.backward()
        self.actor.optimizer.step()

        # Update critic
        self.critic.optimizer.zero_grad()
        critic_loss.backward()
        self.critic.optimizer.step()
        
        return episode_return, actor_loss.item(), critic_loss.item(), len(rewards)

# **Part 3 : Trainning & Playing to stabilize `Cart-Pole` Agent.**

We design experiments by changing the hyperparameters of each algorithm and training them to observe the results. Each experiment train by changing each parameter in **param** individually, while keeping all other parameters set to **defaults** values.

## **Training**

In training part we create loop within n_episode for train agent, we use tensorboard to logging data from agent's parameters. Below is the base training code for all the algorithm. 

```python

    # hyperparameters
    num_of_action = xx
    action_range = [-xx, xx]  
    learning_rate = xx
    hidden_dim = xx
    n_episodes = xx
    initial_epsilon = xx
    epsilon_decay = xx
    final_epsilon = xx
    discount = xx
    buffer_size = xx
    batch_size = xx

    # Setup matplotlib 
    is_ipython = 'inline' in matplotlib.get_backend()
    if is_ipython:
        from IPython import display
    plt.ion()

    # if GPU is to be used
    device = torch.device(
        "cuda" if torch.cuda.is_available() else
        "mps" if torch.backends.mps.is_available() else
        "cpu"
    )

    task_name = str(args_cli.task).split('-')[0]  # Stabilize, SwingUp
    Algorithm_name = "Linear_Q"

    # Depend on which algorithm class we use
    agent = Algo_xxx(
        device = device,
        num_of_action = num_of_action,
        action_range = action_range,
        learning_rate = learning_rate,
        hidden_dim = hidden_dim,
        initial_epsilon = initial_epsilon,
        epsilon_decay = epsilon_decay,
        final_epsilon = final_epsilon,
        discount_factor = discount,
        buffer_size = buffer_size,
        batch_size = batch_size,
    )

    # reset environment
    obs, _ = env.reset()
    timestep = 0

    # tensor board logging
    log_dir = os.path.join("runs", f"{task_name}_{Algorithm_name}_{num_of_action}_{action_range[1]}")
    writer = SummaryWriter(log_dir)

    episode_rewards = []
    episode_lengths = []

    # simulate environment
    while simulation_app.is_running():

        # run everything in inference mode
        with torch.inference_mode():
            for episode in tqdm(range(n_episodes)):

                # Agent learning 
                rewards, episode_len = agent.learn(env)

                # logging data
                writer.add_scalar("Reward/Episode", rewards, episode)
                writer.add_scalar("Policy/Epsilon", agent.epsilon, episode)
                writer.add_scalar("Episode/Length", episode_len, episode)
                episode_rewards.append(rewards.item())
                episode_lengths.append(episode_len)

                if episode % 100 == 0:

                    # logging average over 100 episodes
                    avg_rew_100 = np.mean(episode_rewards[-100:])
                    avg_ep_len_100 = np.mean(episode_rewards[-100:])
                    writer.add_scalar("Reward/Avg_Reward_100", avg_rew_100, episode)
                    writer.add_scalar("Episode/Avg_Ep_len_100", avg_ep_len_100, episode)

                    print("epsilon : ", agent.epsilon)
                    print("reward : ", avg_rew_100)

                    # Save Q-Learning agent
                    w_file = f"{Algorithm_name}_{episode}_{num_of_action}_{action_range[1]}.json"
                    full_path = os.path.join(f"w/{task_name}", Algorithm_name)
                    agent.save_w(full_path, w_file)
        
        print('Complete')
        agent.plot_durations(show_result=True)
        plt.ioff()
        plt.show()
            
        if args_cli.video:
            timestep += 1
            # Exit the play loop after recording one video
            if timestep == args_cli.video_length:
                break
        break
    # close the simulator
    env.close()
    writer.close()
```

## Play

In the below play code is the base code for `play.py` for all algorithm, this code need to load model from `training.py` that save weight of the model. We code to collect state and video of the highest alive time for range of n_episode that we will analyze and visualize later. 

```python

    # hyperparameter
    num_of_action = xx
    action_range = [-xx, xx]  
    learning_rate = xx
    hidden_dim = xx
    n_episodes = xx
    initial_epsilon = xx
    epsilon_decay = xx
    final_epsilon = xx
    discount = xx
    buffer_size = xx
    batch_size = xx

    # set up matplotlib
    is_ipython = 'inline' in matplotlib.get_backend()
    if is_ipython:
        from IPython import display
    plt.ion()

    # if GPU is to be used
    device = torch.device(
        "cuda" if torch.cuda.is_available() else
        "mps" if torch.backends.mps.is_available() else
        "cpu"
    )

    # Depend on which algorithm class we use
    agent = Algo_xxx(
        device = device,
        num_of_action = num_of_action,
        action_range = action_range,
        learning_rate = learning_rate,
        hidden_dim = hidden_dim,
        initial_epsilon = initial_epsilon,
        epsilon_decay = epsilon_decay,
        final_epsilon = final_epsilon,
        discount_factor = discount,
        buffer_size = buffer_size,
        batch_size = batch_size,
    )

    task_name = str(args_cli.task).split('-')[0]  # Stabilize, SwingUp

    # Select model weight for each algorithm
    q_value_file = "xxxx.json"    
    full_path = os.path.join(f"w/{task_name}", Algorithm_name)
    agent.load_w(full_path, q_value_file)

    # reset environment
    obs, _ = env.reset()
    timestep = 0
    # simulate environment
    while simulation_app.is_running():
        longest_episode_log = []
        max_steps = 0
        best_return = 0.0
        # run everything in inference mode
        with torch.inference_mode():
            for episode in range(n_episodes):
                obs, _ = env.reset()
                done = False
                total_reward = 0
                steps = 0
                episode_log = []
                episode_frames = []

                obs = obs['policy'][0].cpu().numpy()
                while not done:

                    # Agent selects action
                    action = agent.select_action(obs)

                    # Environment step
                    next_obs, reward, terminated, truncated, _ = env.step(action)
                    done = terminated or truncated
                    total_reward += reward
                    steps += 1

                    # Log state
                    tmp = next_obs.cpu().numpy().flatten().tolist() if isinstance(next_obs, torch.Tensor) else np.array(next_obs).flatten().tolist()
                    episode_log.append(tmp[0]['policy'].cpu().numpy()[0])
                    
                    # Update observation
                    obs = next_obs['policy'][0].cpu().numpy()

                    # Capture frame
                    if args_cli.video:
                        frame = env.render()
                        episode_frames.append(frame)

                if steps > max_steps:
                    max_steps = steps
                    longest_episode_log = episode_log
                    longest_episode_frames = episode_frames
                    best_return = total_reward

        # Save state log to CSV
        with open('Linear_Q_longest_episode_5k.csv', mode='w', newline='') as file:
            writer = csv.writer(file)
            writer.writerows(longest_episode_log)

        print(f"\nLongest episode: {max_steps} steps | Return: {best_return.item():.4f}")
        print("Saved as 'Linear_Q_longest_episode.csv'")

        # Save video
        if args_cli.video and longest_episode_frames:
            video_path = Path("Linear_Q_longest_episode_5k.mp4")
            imageio.mimsave(video_path, longest_episode_frames, fps=100)
            print(f"Saved video as '{video_path}'")

        break  # Exit after evaluation

    # close the simulator
    env.close()
```

## Plot state 

```python
import pandas as pd
import matplotlib.pyplot as plt
import re

csv_files= [
    'DQN_longest_episode_hi256.csv',
    'Linear_Q_longest_episode.csv',
    'MC_longest_episode.csv',
    'AC_longest_episode_ar_10_10.csv'
]
columns = ["Pos:cart", "Pos:pole", "Vel:cart", "Vel:pole"]

def extract_label(filename):
    """Extract algorithm name from filename prefix."""
    match = re.match(r'([A-Za-z_]+)_longest_episode', filename)
    if match:
        return match.group(1).replace("_", "")
    return "default"

plt.figure(figsize=(15, 10))

for i, column in enumerate(columns):
    plt.subplot(2, 2, i + 1)

    for csv_file in csv_files:
        df = pd.read_csv(csv_file, header=None, names=columns)
        label = extract_label(csv_file)
        plt.plot(df[column], label=label, linewidth=1)

    plt.title(f'{column}', fontsize=10)
    plt.xlabel('Time Step', fontsize=8)
    plt.ylabel('Value', fontsize=8)
    plt.legend(fontsize='x-small')
plt.savefig("All.png", dpi=300)  

plt.tight_layout()
```

## Linear Q-learning

```py
defaults = {
    'num_of_action': 5,
    'action_range': [-20.0, 20.0],
    'learning_rate': 0.001,
    'n_episodes': 3000,
    'initial_epsilon': 1.0,
    'epsilon_decay': 0.998,
    'final_epsilon': 0.001,
    'discount': 0.9
}
```

```py
param = {
    "learning_rate": [0.005],
    "discount": [0.8],
    "action_range": [[-10, 10]],
}
```

**3000 episodes train**

<p align = "center">
    <img src="result/LQ_reward.png" alt="Alt text" width="800"/>
</p>

Highest reward from 3000 episodes is default with adjust *`lower action range(orange)`* from default

**Increase train to 5000 episodes**

<p align = "center">
    <img src="result/LQ_reward_5000ep.png" alt="Alt text" width="800"/>
</p>

Highest reward from 5000 episodes is default with adjust *`lower action range(dark blue)`* from default

From the reward graph, we select the model trained by changing *`action_range`*, as it give the highest overall reward across both 3000(orange line action_range = [-10,10]) and 5000 episodes(dark blue line action_range = [-10,10]).

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/Linear_Q_longest_episode.mp4" type="video/mp4">
    </video>
    <p>action range (10,10) from 3000 episode</p>
  </div>

  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/Linear_Q_longest_episode_5k.mp4" type="video/mp4">
    </video>
    <p>action range (10,10) from 5000 episode</p>
  </div>
</div>

<p align = "center">
    <img src="result/LQ.png" alt="Alt text" width="800"/>
</p>

This image sample **agent play on 1 episode before terminated**, it **show state(pos, velo) of cart and pole** that label on the title. Both 2 parameter is for 3,000(default) and 5,000(5k) episodes. On default parameter cart pole can stabilize pole but terminnate by sliding out of the boundary but in 5k doesn't has the same result that can not hold the stability task(terminate by pole out of coundary). 

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <img src="result/LQ_raw.png" alt="Alt text" width="800"/>
    <p>LQ raw rewards during training</p>
  </div>

  <div style="width: 48%;">
    <img src="result/LQ_avg.png" alt="Alt text" width="800"/>
    <p>LQ average rewards during training</p>
  </div>
</div>

This image show the variance of reward per episode that red line(5k eps) has higher variance than the green line(3k eps) although the average reward of 5k episodes higher than 3k episodes but has higher variance, so, in conclusion we think **3000 episode with action range [-10, 10] is better!**

**The best adjusting for Linear Q is *lower action range(-10, 10)* from the default(-20, 20)**

## Deep Q-Network (DQN)

```py
defaults = {
    'num_of_action': 7,
    'action_range': [-20.0, 20.0],
    'learning_rate': 0.001,
    'hidden_dim': 128,
    'n_episodes': 3000,
    'tau': 0.005,
    'dropout': 0.1,
    'initial_epsilon': 1.00,
    'epsilon_decay': 0.998,
    'final_epsilon': 0.01,
    'discount': 0.9,
    'buffer_size': 10000,
    'batch_size': 1
}
```

```py
param = {
    "learning_rate": [0.005],
    "hidden_dim": [256],
    "action_range": [[-10, 10]],
    "tau": [0.001, 0.01],
    "dropout": [0.2],
    "buffer_size": [50000],
    "discount": [0.8]
}
```

<p align = "center">
    <img src="result/DQN_reward.png" alt="Alt text" width="800"/>
</p>

Since the reward graph makes it difficult to see the **different of performance from the average reward between each adjust**, we ran all the models to determine which one give the longest episode length, as shown in the graph below.

<p align = "center">
    <img src="result/DQN.png" alt="Alt text" width="800"/>
</p>

Though mostly of DQN adjust can stabilize the pole but we need to find the adjust that give the highest rewards, then we select the model that train by changing *`hidden_dim`* as it give longest step also the highest rewards too.


<p align="center">
  <video width="800" controls>
    <source src="result/DQN_longest_episode_hi256.mp4" type="video/mp4">
  </video>
</p>


**The best for DQN is default parameter with *higher hidden dimension(256)* than default(128)**

Although in this model can alive the longest time, it terminate by sliding out of boundary.


## Monte Carlo REINFORCE


```py
defaults = {
    'num_of_action': 7,
    'action_range': [-20.0, 20.0],
    'learning_rate': 0.001,
    'hidden_dim': 128,
    'n_episodes': 3000,
    'n_observations': 4,
    'dropout': 0.1,
    'discount': 0.9
}
```

```py
param = {
    "learning_rate": [0.0005],
    "hidden_dim": [256],
    "discount": [0.8],
    "dropout": [0.2],
    "action_range": [[-10, 10]],
}
```

<p align = "center">
    <img src="result/MC_reward.png" alt="Alt text" width="800"/>
</p>

From the reward graph, we select the model trained by changing *`discount_factor(Pink line)`* that giving the highest average reward at the end.

<p align="center">
  <video width="800" controls>
    <source src="result/MC_longest_episode.mp4" type="video/mp4">
  </video>
</p>

<p align = "center">
    <img src="result/MC.png" alt="Alt text" width="800"/>
</p>

**The best for MC REINFORCE is default parameter with *lower discount factor(0.8)* than default(0.9)**

## Actor-Critic (A2C: Advantage Actor-Critic)


```py
defaults = {
    'num_of_action': 7,
    'action_range': [-20.0, 20.0],
    'learning_rate': 0.001,
    'hidden_dim': 128,
    'n_episodes': 3000,
    'n_observations': 4,
    'discount': 0.9
}
```

```py
param = {
    "learning_rate": [0.005],
    "hidden_dim": [256],
    "discount": [0.8],
    "action_range": [[-10, 10]],
}
```

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <img src="result/AC_reward.png" alt="Alt text" width="800"/>
    <p>LQ raw rewards during training</p>
  </div>

  <div style="width: 48%;">
    <img src="result/AC_loss.png" alt="Alt text" width="800"/>
    <p>LQ average rewards during training</p>
  </div>
</div>

As seen in the left graph, the *default(pink)* parameters and *action_range(black)* model are the highest reward and have similar rewards, so we selected both models to run further evaluations.  

But in the right image, some adjust(higher hidden dimension - orange) has higher loss from the actor that prone the overfitting problem(reward decrease to zero) 

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/AC_longest_episode_default.mp4" type="video/mp4">
    </video>
    <p>default parameter</p>
  </div>

  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/AC_longest_episode_ar_10_10.mp4" type="video/mp4">
    </video>
    <p>changing action_range</p>
  </div>
</div>

Both Actor critic adjust **can alive with the full timestep(1000 steps or 10 seconds) but which one is better?**

<p align = "center">
    <img src="result/blackpink.png" alt="Alt text" width="800"/>
</p>

In this image show how **stable of rewards across the episode**, in action range(black) has higher reward bound than default(pink) and also more stable reward too.

<p align = "center">
    <img src="result/AC.png" alt="Alt text" width="800"/>
</p>

So, we plot the position and velocity of both to visualize, both can stabilize pole also stabilize cart in the boundary but we can see the **rate of change** of position and velocity that **lower action range(orange)** has **more stable and lower rate of change** with compare to default(blue)

**The best for AC is default parameter with *lower action range(-10, 10)* than default(-20, 20)**

# **Part 4 : Evaluate `Cart-Pole` Agent performance.**

- Learning efficiency (how well agent learns to recieve higher rewards)
- Deployment performance (how well the agent perform in stabilize problem)

Analyze and visualize the result to determine: 
1. Which algo performs best?
2. Why does it perform better than the other?

From Part 3, the *Actor-Critic (A2C: Advantage Actor-Critic)* give the best performance.

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/AC_longest_episode_default.mp4" type="video/mp4">
    </video>
    <p>default parameter</p>
  </div>

  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/AC_longest_episode_ar_10_10.mp4" type="video/mp4">
    </video>
    <p>changing action_range</p>
  </div>
</div>

`A2C` combine both value-based and policy-based
- Actor learn policy
- Critic learn value

Compare to `Linear Q Learning` that use linear approximation.

$$ q(s, a) \approx \mathbf{w}^\top \phi(s, a) $$

So Linear model may not be able to model the non-linear(complex) state-action space in this task, as the cart-pole dynamics are non-linear.

<p align = "center">
    <img src="result/LQ_reward_ep.png" alt="Alt text" width="800"/>
</p>

Where A2C use neural-network so it can handle more complex task.

In `MC_Reinforce` it update policy gradient $$ \theta_{t+1} \leftarrow \theta_t + \alpha_t \, G_t \, \nabla_\theta \log \pi_\theta(a_t | s_t) $$ where it compute return ($G_t$) from Monte-Carlo algorithm $$ G_t = \sum_{k=0}^{T - t - 1} \gamma^k R_{t + k + 1} $$ so it may lead to high-varaince during traning

<p align = "center">
    <img src="result/MC_reward_ep.png" alt="Alt text" width="800"/>
</p>

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <img src="result/MC_lately.png" alt="Alt text" width="800"/>
    <p>MC reward during traning</p>
  </div>

  <div style="width: 48%;">
    <img src="result/AC_lately.png" alt="Alt text" width="800"/>
    <p>AC reward during traning</p>
  </div>
</div>

As can see AC has less variance in the lately episode of training.

Moreover, we think that `DQN` and `MC_RL` struggle with the local optima problem, as can see in the video where the cart can stabilize pole but cart moves slightly to the left or right before the episode terminates(cart terminate by move out of boundary).

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/DQN_longest_episode_hi256.mp4" type="video/mp4">
    </video>
    <p>DQN</p>
  </div>

  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/MC_longest_episode.mp4" type="video/mp4">
    </video>
    <p>MC_RL</p>
  </div>
</div>

This may be due to the reward function, which gives a reward simply for keeping the cart-pole alive. As a result, the agent focuses more on surviving as long as possible rather than keeping the pole upright at the center.

```py
# (1) Constant running reward
alive = RewTerm(func=mdp.is_alive, weight=1.0)
# (2) Failure penalty
terminating = RewTerm(func=mdp.is_terminated, weight=-2.0)
# (3) Primary task: keep pole upright
pole_pos = RewTerm(
    func=mdp.joint_pos_target_l2,
    weight=-1.0,
    params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"]), "target": 0.0},
)
```

Now we use only 3 reward terms that gain the reward when `cart pole` alive and pole position is upright and reduce reward when it terminate, but it doesn't has term for gain reward when cart has stable position or doesn't move, so, we can add the below reward term for cart pole more stable both cart and pole

```py
# (4) Shaping tasks: lower cart velocity
cart_vel = RewTerm(
    func=mdp.joint_vel_l1,
    weight=-0.01,
    params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"])},
)
# (5) Shaping tasks: lower pole angular velocity
pole_vel = RewTerm(
    func=mdp.joint_vel_l1,
    weight=-0.005,
    params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"])},
)
```


### **Conclusion**
For all approximation function we choose the best algorithm that can stabilize pole upright with the longest alive time including lowest variance mean every time we use this model for play stailize model should has closely reward for all random environment. 

#### **The best algorithm is `Advantage Actor-Critic (A2C) with lower action range` algorithm**

<p align="center">
  <video width="800" controls>
    <source src="result/AC_longest_episode_ar_10_10.mp4" type="video/mp4">
  </video>
</p>