<a href="https://colab.research.google.com/github/dumoura/gofai/blob/main/RL_A3C_Algorithm_Kung_Fu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning and Asynchronous Actor-Critic Agent (A3C)

## Part 0 - Installing the required packages and importing the libraries

In [None]:
from google.colab import data_table

In [None]:
data_table.enable_dataframe_formatter()

### Installing Gymnasium

In [None]:
# @title Aperte o play para instalar Gymnasium {display-mode: "form"}
# This code will be hidden when the notebook is loaded.
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]

Collecting gymnasium
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-0.29.1
Collecting shimmy[atari]<1.0,>=0.1.0 (from gymnasium[accept-rom-license,atari])
  Downloading Shimmy-0.2.1-py3-none-any.whl (25 kB)
Collecting autorom[accept-rom-license]~=0.4.2 (from gymnasium[accept-rom-license,atari])
  Downloading AutoROM-0.4.2-py3-none-any.whl (16 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.4.2->gymnasium[accept-rom-license,atari])
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.7/434.7 kB[0m [31m13.2 

### Importing the libraries

In [None]:
import cv2
import math
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.multiprocessing as mp
import torch.distributions as distributions
from torch.distributions import Categorical
import gymnasium as gym
from gymnasium import ObservationWrapper
from gymnasium.spaces import Box

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

**Introduction to the Network Class**

The network bellow is designed to process input images and produce two outputs: action values and a state value.

**Components of the Network**

Convolutional Layers: These layers are used to extract features from the input images.

- conv1: Takes in 4 channels (e.g., 4 stacked grayscale images) and outputs 32 feature maps.
- conv2: Takes in the 32 feature maps from conv1 and outputs another set of 32 feature maps.
- conv3: Further processes the 32 feature maps from conv2.

Flatten Layer: Converts the 2D feature maps into a 1D vector to feed into the fully connected layers.

Fully Connected Layers (FC):

- fc1: Takes the flattened vector and produces 128 features.
- fc2a: Maps the 128 features to action_size outputs (for action values).
- fc2s: Maps the 128 features to a single output (for the state value).

**Forward Method**

The forward method defines the forward pass of the network, i.e., how the input data flows through the network to produce the outputs.

- The input state is passed through the convolutional layers (conv1, conv2, conv3), with ReLU activations applied after each layer.
- The output of the last convolutional layer is flattened.
- The flattened vector is passed through fc1 and a ReLU activation.
- The resulting features are then split into action values (fc2a) and a state value (fc2s).

In [None]:
class Network(nn.Module):

  def __init__(self, action_size):
    super(Network, self).__init__()

    self.conv1 = torch.nn.Conv2d(in_channels = 4, out_channels = 32, kernel_size = (3,3), stride = 2) # in = 4 stack of gray scale images from the prepocess bellow
    self.conv2 = torch.nn.Conv2d(in_channels= 32, out_channels = 32, kernel_size = (3,3), stride = 2)
    self.conv3 = torch.nn.Conv2d(in_channels= 32, out_channels = 32, kernel_size = (3,3), stride = 2)

    # creat the flating layer
    self.flatten = torch.nn.Flatten()

    self.fc1 = torch.nn.Linear(512, 128)
    self.fc2a = torch.nn.Linear(128, action_size) # for the policy --> q --> action values
    self.fc2s = torch.nn.Linear(128, 1) # for the value --> c --> state values

    # foward method
  def forward(self, state):
    x = self.conv1(state)
    x = F.relu(x)

    x = self.conv2(x)
    x = F.relu(x)

    x = self.conv3(x)
    x = F.relu(x)

    # flatten the output
    x = self.flatten(x)

    x = self.fc1(x)
    x = F.relu(x)

    action_values = self.fc2a(x)
    state_value = self.fc2s(x)[0]

    return action_values, state_value




## Part 2 - Training the AI

### Setting up the environment

Sure! Let's break down the `PreprocessAtari` class and the `make_env` function step by step.

**Explanation**

The `PreprocessAtari` class is a custom environment wrapper for preprocessing Atari game observations before they are fed into a neural network. This preprocessing typically includes resizing, cropping, converting to grayscale, and stacking multiple frames.

#### `__init__` Method
- **Parameters**:
  - `env`: The original environment to be wrapped.
  - `height` and `width`: The dimensions to resize the frames to (default is 42x42).
  - `crop`: A function to crop the image (default is an identity function).
  - `dim_order`: The order of dimensions in the processed image ('pytorch' or 'tensorflow').
  - `color`: Whether to keep the image in color (default is `False`, meaning grayscale).
  - `n_frames`: The number of frames to stack (default is 4).

- **Initialization**:
  - Sets image size, cropping function, dimension order, color option, and number of frames.
  - Determines the number of channels based on whether the images are in color.
  - Sets the observation space dimensions.
  - Initializes a buffer (`self.frames`) to store the stacked frames.

**`reset` Method**
- **Function**:
  - Resets the buffer to zeros.
  - Resets the original environment and updates the buffer with the initial observation.
  - Returns the processed frames and any additional information from the environment.

**`observation` Method**
- **Function**:
  - Crops and resizes the image.
  - Converts the image to grayscale if `color` is `False`.
  - Normalizes the image to have pixel values between 0 and 1.
  - Rolls the buffer to make space for the new frame.
  - Inserts the new frame into the buffer.
  - Returns the updated buffer.

**`update_buffer` Method**
- **Function**:
  - Updates the frame buffer with the processed observation.

**`make_env` Function**
- **Function**:
  - Creates the original Atari environment.
  - Wraps the environment with the `PreprocessAtari` wrapper.
  - Returns the wrapped environment.

**Example Usage**
- Creates an environment using `make_env()`.
- Prints the shape of the observation space and the number of actions.
- Prints the action names of the environment.

**Code Summary**

This setup preprocesses the images from the Atari environment so that they are ready for input into a neural network. This preprocessing helps in reducing the complexity of the input data while retaining the essential information needed for decision-making.

In [None]:
class PreprocessAtari(ObservationWrapper):

  def __init__(self, env, height = 42, width = 42, crop = lambda img: img, dim_order = 'pytorch', color = False, n_frames = 4):
    super(PreprocessAtari, self).__init__(env)
    self.img_size = (height, width)
    self.crop = crop
    self.dim_order = dim_order
    self.color = color
    self.frame_stack = n_frames
    n_channels = 3 * n_frames if color else n_frames
    obs_shape = {'tensorflow': (height, width, n_channels), 'pytorch': (n_channels, height, width)}[dim_order]
    self.observation_space = Box(0.0, 1.0, obs_shape)
    self.frames = np.zeros(obs_shape, dtype = np.float32)

  def reset(self):
    self.frames = np.zeros_like(self.frames)
    obs, info = self.env.reset()
    self.update_buffer(obs)
    return self.frames, info

  def observation(self, img):
    img = self.crop(img)
    img = cv2.resize(img, self.img_size)
    if not self.color:
      if len(img.shape) == 3 and img.shape[2] == 3:
        img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img = img.astype('float32') / 255.
    if self.color:
      self.frames = np.roll(self.frames, shift = -3, axis = 0)
    else:
      self.frames = np.roll(self.frames, shift = -1, axis = 0)
    if self.color:
      self.frames[-3:] = img
    else:
      self.frames[-1] = img
    return self.frames

  def update_buffer(self, obs):
    self.frames = self.observation(obs)

def make_env():
  env = gym.make("KungFuMasterDeterministic-v0", render_mode = 'rgb_array')
  env = PreprocessAtari(env, height = 42, width = 42, crop = lambda img: img, dim_order = 'pytorch', color = False, n_frames = 4)
  return env

env = make_env()

state_shape = env.observation_space.shape
number_actions = env.action_space.n
print("Observation shape:", state_shape)
print("Number actions:", number_actions)
print("Action names:", env.env.env.get_action_meanings())

  logger.deprecation(


Observation shape: (4, 42, 42)
Number actions: 14
Action names: ['NOOP', 'UP', 'RIGHT', 'LEFT', 'DOWN', 'DOWNRIGHT', 'DOWNLEFT', 'RIGHTFIRE', 'LEFTFIRE', 'DOWNFIRE', 'UPRIGHTFIRE', 'UPLEFTFIRE', 'DOWNRIGHTFIRE', 'DOWNLEFTFIRE']


  logger.warn(


### Initializing the hyperparameters

In [None]:
learning_rate = 1e-4
discount_factor = 0.99 #gama
number_environments = 10 #number of parallel environments

### Implementing the A3C class

In the context of the `Agent` class's `act` method below, a "list of states" refers to a batch of multiple state observations rather than a single state observation. This concept is important for efficiently processing multiple states at once using the neural network, especially during training or evaluation when multiple states can be processed in parallel.

### Explanation

#### Why Use a List of States?
- **Batch Processing**: Processing multiple states in parallel can significantly speed up computations, especially when using a GPU.
- **Training Efficiency**: In reinforcement learning, it is common to sample a batch of experiences (states) from memory to update the network. This is more efficient than updating the network one state at a time.
- **Consistency**: Ensuring that the network can handle both single states and batches of states makes the code more flexible and easier to extend.

### Example

Let's clarify with a simple example. Assume we have an environment where the state is represented by a 3D array (e.g., an image). If we have a single state, it might look like this:

```python
single_state = np.array([[[0.1, 0.2, 0.3],
                          [0.4, 0.5, 0.6],
                          [0.7, 0.8, 0.9]]])
```

This single state has the shape `(1, 3, 3)`.

Now, if we have a list of states (a batch of multiple states), it might look like this:

```python
list_of_states = np.array([[[[0.1, 0.2, 0.3],
                             [0.4, 0.5, 0.6],
                             [0.7, 0.8, 0.9]]],
                           
                           [[[0.9, 0.8, 0.7],
                             [0.6, 0.5, 0.4],
                             [0.3, 0.2, 0.1]]]])
```

This list of states has the shape `(2, 1, 3, 3)`, where `2` is the number of states in the batch.

### Modifying the `act` Method

The `act` method in the `Agent` class is designed to handle both single states and batches of states. Here’s the relevant part:

```python
def act(self, state):
  if state.ndim == 3:
    state = [state]  # Converts single state to a list of one state

  state = torch.tensor(state, dtype=torch.float32, device=self.device)
  action_values, _ = self.network(state)
  policy = F.softmax(action_values, dim=-1)

  return np.array([np.random.choice(len(p), p=p) for p in policy.detach().cpu().numpy()])
```

- **Checking Dimensionality**: `if state.ndim == 3: state = [state]`
  - If the input state has 3 dimensions, it assumes it is a single state and converts it to a list containing one state, thus creating a batch of size 1.
- **Converting to Tensor**: `state = torch.tensor(state, dtype=torch.float32, device=self.device)`
  - The state (or batch of states) is converted to a PyTorch tensor and moved to the appropriate device (CPU or GPU).
- **Processing through Network**: `action_values, _ = self.network(state)`
  - The batch of states is passed through the neural network to get the action values.
- **Computing Policy**: `policy = F.softmax(action_values, dim=-1)`
  - The softmax function is applied to the action values to get a probability distribution over the actions.
- **Sampling Actions**: `np.array([np.random.choice(len(p), p=p) for p in policy.detach().cpu().numpy()])`
  - Actions are sampled from the policy distribution for each state in the batch.

This allows the agent to handle both single and multiple states seamlessly, making it versatile for various use cases in reinforcement learning.


In [None]:
class Agent():

  def __init__(self, action_values):
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.action_size = action_values

    self.network = Network(self.action_size).to(self.device)
    self.optimizer = torch.optim.Adam(self.network.parameters(), lr = learning_rate)


  def act(self, state):
    if state.ndim == 3: #if the state (or image has 3 dimentions --> RGB) convert it to a list of states - [state] - for batch processing.
      state = [state]

    state = torch.tensor(state, dtype = torch.float32, device = self.device) #Converts the state to a PyTorch tensor and moves it to the appropriate device.
    action_values, _ = self.network(state) #Passes the state through the network to get action values.
    policy = F.softmax(action_values, dim = -1) #Applies the softmax function to the action values to get a probability distribution over the actions.
    return np.array([np.random.choice(len(p), p = p) for p in policy.detach().cpu().numpy()]) #Samples actions from the policy distribution for each state in the batch. Returns an array of actions sampled from the policy distribution.

  def step(self, state, action, reward, next_state, done):
    batch_size = state.shape[0] #number of states in the batch
    state = torch.tensor(state, dtype = torch.float32, device = self.device)
    next_state = torch.tensor(next_state, dtype = torch.float32, device = self.device)
    reward = torch.tensor(reward, dtype = torch.float32, device = self.device)
    done = torch.tensor(done, dtype = torch.bool, device = self.device).to(dtype= torch.float32) #True or False --> dtype bool -> then, float

    action_values, state_value = self.network(state)
    _, next_state_value = self.network(next_state)

    target_state_value = reward + discount_factor * next_state_value * (1 - done) #Bellman equation
    advantage = target_state_value - state_value

    #Critical part --> compute losses
    probs = F.softmax(action_values, dim = -1)
    logprobs = F.log_softmax(action_values, dim = -1)
    entropy = -torch.sum(probs * logprobs, axis = -1)
    batch_idx = np.arange(batch_size)
    logp_actions = logprobs[batch_idx, action]

    actor_loss = - (logp_actions * advantage.detach()).mean() - 0.001 * entropy.mean()
    critic_loss = F.mse_loss(target_state_value.detach(), state_value)
    total_loss = actor_loss + critic_loss

    self.optimizer.zero_grad()
    total_loss.backward()
    self.optimizer.step()


Let's break down the `step` method step-by-step in simple terms to help you understand its purpose and functionality.

### Explanation

The `step` method is part of the `Agent` class and is used for updating the neural network based on the agent's experience (state, action, reward, next state, and whether the episode is done).

#### `step` Method
- **Parameters**:
  - `state`: The current state of the environment.
  - `action`: The action taken by the agent.
  - `reward`: The reward received after taking the action.
  - `next_state`: The state of the environment after taking the action.
  - `done`: A boolean indicating whether the episode has ended.

#### Processing the Input
1. **Batch Size**:
    - `batch_size = state.shape[0]`
    - Determines the number of samples in the batch.

2. **Convert to Tensors**:
    - Converts the input data (state, next_state, reward, done) into PyTorch tensors and moves them to the appropriate device (CPU or GPU).

```python
state = torch.tensor(state, dtype=torch.float32, device=self.device)
next_state = torch.tensor(next_state, dtype=torch.float32, device=self.device)
reward = torch.tensor(reward, dtype=torch.float32, device=self.device)
done = torch.tensor(done, dtype=torch.bool, device=self.device).to(dtype=torch.float32)  # True or False to float
```

#### Forward Pass
3. **Compute Action and State Values**:
    - Passes the current state and next state through the network to get action values and state values.

```python
action_values, state_value = self.network(state)
_, next_state_value = self.network(next_state)
```

#### Compute Target Values
4. **Bellman Equation**:
    - Calculates the target state value using the Bellman equation.

```python
target_state_value = reward + discount_factor * next_state_value * (1 - done)
advantage = target_state_value - state_value
```

#### Compute Losses
5. **Policy Loss (Actor Loss)**:
    - Computes the policy probabilities and log probabilities.
    - Calculates the entropy for exploration.
    - Computes the log probability of the taken action.
    - Computes the actor loss.

```python
probs = F.softmax(action_values, dim=-1)
logprobs = F.log_softmax(action_values, dim=-1)
entropy = -torch.sum(probs * logprobs, axis=-1)
batch_idx = np.arange(batch_size)
logp_action = logprobs[batch_idx, action]

actor_loss = - (logp_action * advantage.detach()).mean() - 0.001 * entropy.mean()
```

6. **Value Loss (Critic Loss)**:
    - Computes the mean squared error loss between the target state value and the predicted state value.

```python
critic_loss = F.mse_loss(target_state_value.detach(), state_value)
```

7. **Total Loss**:
    - Adds the actor loss and critic loss to get the total loss.

```python
total_loss = actor_loss + critic_loss
```

#### Backpropagation
8. **Optimize the Network**:
    - Resets the gradients, performs backpropagation, and updates the network parameters.

```python
self.optimizer.zero_grad()
total_loss.backward()
self.optimizer.step()
```

#### Return Values
9. **Return**:
    - Returns the processed state, next state, reward, action, and done flag.

```python
return state, next_state, reward, action, done
```

### Code Summary
```python
def step(self, state, action, reward, next_state, done):
  batch_size = state.shape[0]

  state = torch.tensor(state, dtype=torch.float32, device=self.device)
  next_state = torch.tensor(next_state, dtype=torch.float32, device=self.device)
  reward = torch.tensor(reward, dtype=torch.float32, device=self.device)
  done = torch.tensor(done, dtype=torch.bool, device=self.device).to(dtype=torch.float32)

  action_values, state_value = self.network(state)
  _, next_state_value = self.network(next_state)

  target_state_value = reward + discount_factor * next_state_value * (1 - done)
  advantage = target_state_value - state_value

  probs = F.softmax(action_values, dim=-1)
  logprobs = F.log_softmax(action_values, dim=-1)
  entropy = -torch.sum(probs * logprobs, axis=-1)
  batch_idx = np.arange(batch_size)
  logp_action = logprobs[batch_idx, action]

  actor_loss = - (logp_action * advantage.detach()).mean() - 0.001 * entropy.mean()
  critic_loss = F.mse_loss(target_state_value.detach(), state_value)
  total_loss = actor_loss + critic_loss

  self.optimizer.zero_grad()
  total_loss.backward()
  self.optimizer.step()

  return state, next_state, reward, action, done
```

### Simplified Explanation
1. **Input Conversion**: Converts the input data to tensors.
2. **Forward Pass**: Passes the current and next states through the network.
3. **Compute Target Values**: Uses the Bellman equation to compute target values.
4. **Compute Losses**: Calculates the actor and critic losses.
5. **Optimize Network**: Updates the network using backpropagation.
6. **Return**: Returns the processed input data.

This method allows the agent to learn from its experiences by updating the neural network based on the actions taken, the rewards received, and the resulting states.

### Initializing the A3C agent

In [None]:
agent = Agent(number_actions)

In [None]:
agent.network

Network(
  (conv1): Conv2d(4, 32, kernel_size=(3, 3), stride=(2, 2))
  (conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(2, 2))
  (conv3): Conv2d(32, 32, kernel_size=(3, 3), stride=(2, 2))
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (fc1): Linear(in_features=512, out_features=128, bias=True)
  (fc2a): Linear(in_features=128, out_features=14, bias=True)
  (fc2s): Linear(in_features=128, out_features=1, bias=True)
)

The provided snippet shows a summary of the neural network architecture using the PyTorch framework. Here’s a breakdown of each layer and its components to help you understand the structure of the `Network` class.

### Neural Network Architecture Breakdown

1. **Convolutional Layers**:
    - **conv1**:
        - `Conv2d(4, 32, kernel_size=(3, 3), stride=(2, 2))`
        - This layer takes 4 input channels (like a stack of 4 grayscale images) and outputs 32 feature maps. The kernel size is 3x3, and it moves with a stride of 2 pixels.
    - **conv2**:
        - `Conv2d(32, 32, kernel_size=(3, 3), stride=(2, 2))`
        - This layer takes 32 input channels and outputs 32 feature maps. The kernel size is 3x3, and the stride is 2.
    - **conv3**:
        - `Conv2d(32, 32, kernel_size=(3, 3), stride=(2, 2))`
        - This layer also takes 32 input channels and outputs 32 feature maps. The kernel size is 3x3, and the stride is 2.

2. **Flatten Layer**:
    - **flatten**:
        - `Flatten(start_dim=1, end_dim=-1)`
        - This layer flattens the input tensor starting from the first dimension (excluding the batch dimension) to the last dimension, converting the 3D feature maps into a 1D vector.

3. **Fully Connected Layers**:
    - **fc1**:
        - `Linear(in_features=512, out_features=128, bias=True)`
        - This fully connected layer takes 512 input features (after flattening) and outputs 128 features.
    - **fc2a**:
        - `Linear(in_features=128, out_features=14, bias=True)`
        - This fully connected layer takes 128 input features and outputs 14 features, representing the action values (assuming there are 14 possible actions).
    - **fc2s**:
        - `Linear(in_features=128, out_features=1, bias=True)`
        - This fully connected layer takes 128 input features and outputs a single value, representing the state value.

In the context of neural networks, the term `bias` refers to an additional parameter in each neuron that allows the model to fit the data better. The `bias` is used to shift the activation function to the left or right, which can help the model make more accurate predictions.

### Explanation of `bias=True`

When `bias=True` is specified in a layer (like in PyTorch's `Linear` or `Conv2d` layers), it means that each neuron in that layer has an associated bias parameter that will be learned during training.

#### How Bias Works

Consider a simple neural network layer, such as a fully connected (linear) layer, mathematically described as:

\[ y = Wx + b \]

- **W**: Weights matrix
- **x**: Input vector
- **b**: Bias vector
- **y**: Output vector

For each neuron in the layer:
- The input vector \( x \) is multiplied by the weights matrix \( W \).
- The bias vector \( b \) is added to the resulting product.

### Example

#### Without Bias
If `bias=False`, the computation would be:

\[ y = Wx \]

In this case, the network relies solely on the weighted sum of the inputs. This might not be sufficient to accurately model complex patterns in the data.

#### With Bias
If `bias=True`, the computation is:

\[ y = Wx + b \]

Here, the bias \( b \) allows each neuron to have an additional degree of freedom. This can help the network learn patterns more effectively because it can adjust the output of each neuron independently of the input.

### Why Bias is Important

1. **Flexibility**: The bias allows the activation function to be shifted, providing more flexibility to the model.
2. **Better Learning**: It helps the model to better fit the data, especially in cases where the data points do not pass through the origin.
3. **Improved Performance**: Bias can improve the performance of the neural network by allowing each neuron to learn more complex patterns.


### Summary

- **Bias Parameter**: An additional parameter in each neuron.
- **Function**: Shifts the activation function, providing more flexibility.
- **Benefit**: Helps the network learn more complex patterns and fit the data better.
- **Usage**: Specified in layers like `Linear` or `Conv2d` by setting `bias=True`.

Including bias terms generally helps improve the neural network's learning capability and overall performance.

### Layer-by-Layer Summary

- **Input**: The network starts with an input consisting of 4 channels (stacked grayscale images).
- **Conv1**: The first convolutional layer extracts 32 features from the input using 3x3 filters with a stride of 2.
- **Conv2**: The second convolutional layer further processes the 32 feature maps, maintaining the same number of output channels.
- **Conv3**: The third convolutional layer continues to process the feature maps, maintaining the same number of output channels.
- **Flatten**: The flatten layer converts the 3D feature maps into a 1D vector to prepare for the fully connected layers.
- **FC1**: The first fully connected layer reduces the dimensionality from 512 to 128 features.
- **FC2a**: The second fully connected layer (for action values) outputs 14 features, each representing a possible action.
- **FC2s**: The third fully connected layer (for state value) outputs a single value, representing the state value.

This architecture processes the input images through a series of convolutional layers to extract features, flattens the features, and then passes them through fully connected layers to produce action values and state values. The action values help the agent decide which action to take, while the state value helps estimate the value of the current state.

### Evaluating our A3C agent on a single episode

In [None]:
def evaluate(agent, env, n_episodes = 1):
  episodes_rewards = []

  for _ in range(n_episodes):
    state, _ = env.reset()
    total_reward = 0

    while True:
      action = agent.act(state)
      state, reward, done, info, _ = env.step(action[0])
      total_reward += reward

      if done:
        break

    episodes_rewards.append(total_reward)
  return episodes_rewards

### Testing multiple agents on multiple environments at the same time

In [None]:
class EnvBatch:

  def __init__(self, n_envs = 10): #create 10 env at once
    self.envs = [make_env() for _ in range(n_envs)]

  def reset(self):
    _states = []

    for env in self.envs:
      _states.append(env.reset()[0]) #env[0] --> state
    return np.array(_states)

  def step(self, actions):
    next_states, rewards, dones, infos, _ = map(np.array, zip(*[env.step(a) for env, a in zip(self.envs, actions)]))

    for i in range(len(self.envs)):
      if dones[i]:
        next_states[i] = self.envs[i].reset()[0]

    return next_states, rewards, dones, infos



### Training the A3C agent

In [None]:
import tqdm

env_batch = EnvBatch(number_environments)
batch_states = env_batch.reset()

with tqdm.trange(0, 3001) as progress_bar: #3000 interantions

  for i in progress_bar: #3000 iterations in the progress barr

    batch_actions = agent.act(batch_states)
    batch_next_states, batch_rewards, batch_dones, _ = env_batch.step(batch_actions)
    batch_rewards *= 0.001
    agent.step(batch_states, batch_actions, batch_rewards, batch_next_states, batch_dones)
    batch_states = batch_next_states

    if i % 1000 == 0:

      print("Average agent reward: ", np.mean(evaluate(agent, env, n_episodes = 10)))

  critic_loss = F.mse_loss(target_state_value.detach(), state_value)
  0%|          | 8/3001 [00:33<2:31:11,  3.03s/it] 

Average agent reward:  550.0


 34%|███▎      | 1008/3001 [01:26<38:10,  1.15s/it]

Average agent reward:  1030.0


 67%|██████▋   | 2008/3001 [02:19<18:03,  1.09s/it]

Average agent reward:  1070.0


100%|██████████| 3001/3001 [03:12<00:00, 15.58it/s]

Average agent reward:  850.0





## Part 3 - Visualizing the results

In [None]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display
from gymnasium.wrappers.monitoring.video_recorder import VideoRecorder

def show_video_of_model(agent, env):
  state, _ = env.reset()
  done = False
  frames = []
  while not done:
    frame = env.render()
    frames.append(frame)
    action = agent.act(state)
    state, reward, done, _, _ = env.step(action[0])
  env.close()
  imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, env)

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()

