To apply **Actor-Critic** methods to your project, you can combine the strengths of **policy-gradient methods** (actor) and **value-based methods** (critic). In an Actor-Critic framework, you have two neural networks:

1. **Actor**: Learns a policy and decides which action to take based on the current state.
2. **Critic**: Evaluates the value of the current state (or state-action pair) and provides feedback to improve the actor's performance.

Here’s a step-by-step guide on how to apply Actor-Critic to your project:

### Steps to Apply Actor-Critic:

1. **Define the Actor Network**: Outputs a probability distribution over actions (like in REINFORCE).
2. **Define the Critic Network**: Outputs a value for the current state (or state-action pair).
3. **Interaction with the Environment**: Use the actor to sample actions, and the critic to estimate the value of states.
4. **Update Both Networks**: The actor is updated using the policy gradient, and the critic is updated to minimize the difference between the predicted and actual value (TD-error).

### 1. Define the Actor-Critic Networks

You will need two neural networks: one for the **Actor** and one for the **Critic**. Both networks can be simple fully connected networks.

#### Actor Network (Policy):
The actor outputs the probability distribution over actions.


In [4]:
import os
print(os.getcwd())
# test command

/home/zhihan/ME5418


In [5]:
import torch
import torch.nn as nn

class ActorNetwork(nn.Module):
    def __init__(self, input_size, output_size):
        super(ActorNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 128)
        self.fc2 = nn.Linear(128, output_size)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.softmax(self.fc2(x), dim=-1)  # Output a probability distribution
        return x



#### Critic Network (Value):
The critic outputs a value (scalar) for the given state.



In [6]:
class CriticNetwork(nn.Module):
    def __init__(self, input_size):
        super(CriticNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 128)
        self.fc2 = nn.Linear(128, 1)  # Outputs a single value
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        value = self.fc2(x)  # Outputs state value
        return value



### 2. Select Actions and Get Value Estimates

You need to use both the actor (to sample actions) and the critic (to estimate the value of the current state). Here's an example:


In [7]:
def select_action(actor, state):
    state = torch.tensor(state, dtype=torch.float32).flatten()  # Flatten the state (FOV)
    action_probs = actor(state)
    action_distribution = torch.distributions.Categorical(action_probs)
    action = action_distribution.sample()  # Sample an action
    return action.item(), action_distribution.log_prob(action)

def get_value(critic, state):
    state = torch.tensor(state, dtype=torch.float32).flatten()  # Flatten the state
    return critic(state)



### 3. Collect Trajectories and Compute Returns

You’ll need to interact with the environment and collect states, actions, rewards, and value estimates:


In [8]:
def collect_trajectory(env, actor, critic, max_steps=1000):
    log_probs = []
    values = []
    rewards = []
    
    state = env.reset()
    for step in range(max_steps):
        action, log_prob = select_action(actor, state)
        value = get_value(critic, state)
        
        next_state, reward, done, _ = env.step(action)
        
        log_probs.append(log_prob)
        values.append(value)
        rewards.append(reward)
        
        state = next_state
        if done:
            break
    
    return log_probs, values, rewards



### 4. Compute Advantage and Update Actor-Critic

In the Actor-Critic method, the **advantage** is the difference between the actual reward and the value estimated by the critic. You’ll use this advantage to update the actor and critic:

- **Advantage**: `A(s, a) = R - V(s)` (where `R` is the actual return, and `V(s)` is the critic's estimate of the state value).

You can compute the **returns** and **advantages** like this:


In [9]:
def compute_returns(rewards, gamma=0.99):
    returns = []
    R = 0
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    return returns

def update_actor_critic(actor, critic, optimizer_actor, optimizer_critic, log_probs, values, rewards, gamma=0.99):
    returns = compute_returns(rewards, gamma)
    returns = torch.tensor(returns, dtype=torch.float32)
    
    values = torch.cat(values)
    
    # Advantage is the difference between the return and the value
    advantages = returns - values
    
    # Actor loss
    actor_loss = 0
    for log_prob, advantage in zip(log_probs, advantages):
        actor_loss -= log_prob * advantage  # Policy gradient update
    
    # Critic loss (mean squared error between predicted value and return)
    critic_loss = torch.mean((returns - values) ** 2)
    
    # Update actor
    optimizer_actor.zero_grad()
    actor_loss.backward()
    optimizer_actor.step()
    
    # Update critic
    optimizer_critic.zero_grad()
    critic_loss.backward()
    optimizer_critic.step()



- **Actor Loss**: The actor loss is computed based on the advantage (i.e., how much better or worse the action was compared to the expected value).
- **Critic Loss**: The critic is updated by minimizing the difference between the actual return and the value estimate.


### 5. Training Loop

Now integrate everything into the training loop:


In [11]:

from dummy_gym import DummyGym
env = DummyGym(init_pos=(2, 3), map_size=(30, 30), num_of_obstacles=140, FOV=(5, 5))

# Define the networks and optimizers
actor = ActorNetwork(input_size=env.observation_space().count, output_size=4)
critic = CriticNetwork(input_size=env.observation_space().count)

optimizer_actor = torch.optim.Adam(actor.parameters(), lr=1e-3)
optimizer_critic = torch.optim.Adam(critic.parameters(), lr=1e-3)

for episode in range(1000):
    log_probs, values, rewards = collect_trajectory(env, actor, critic)
    update_actor_critic(actor, critic, optimizer_actor, optimizer_critic, log_probs, values, rewards)
    
    if episode % 100 == 0:
        print(f"Episode {episode} complete.")


TypeError: empty(): argument 'size' must be tuple of ints, but found element of type builtin_function_or_method at pos 2


### Key Concepts Recap:
- **Actor Network**: Outputs the probability distribution over actions.
- **Critic Network**: Outputs a value estimate for the current state.
- **Advantage Calculation**: The advantage is the difference between the actual return and the value estimate, used to guide the policy updates.
- **Updates**: The actor is updated using the policy gradient, while the critic is updated by minimizing the value estimation error.

### Benefits of Actor-Critic:
- **Stabilization**: The critic helps stabilize the learning process by providing better estimates of the expected return, reducing variance in policy updates.
- **Continuous Learning**: Actor-Critic methods can be extended to handle continuous action spaces and more complex tasks.

This framework will allow your robot to learn more efficiently by combining the exploration strengths of policy-gradient methods with the value estimation of critic-based methods.