# Breakout: A performance comparison between a FeedForward and a Convolutional Neural Network

<div style="text-align:center;">
    <img src="./recordings/ATARI_Breakout_Eval_model_21700_reward_357.gif" style="width:30%; height:auto;">
</div>

<div style="display:flex; justify-content:center;">
    <img src="./pictures/Screenshot 2024-02-26 193156.png" style="width:30%; height:auto; margin-right:20px;">
    <img src="./pictures/Screenshot 2024-02-26 193417.png" style="width:10%; height:auto;">
</div>


In [None]:
from breakout_wrapper import make_atari_breakout, wrap
import gym
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
import csv
import time
import pickle
import gzip
import os
import signal
import sys
import datetime

## Feed-forward Neural Network

- Input Layer: The model takes an input of shape (84, 84, 4), representing a stack of four 84x84 grayscale frames. This allows the model to consider temporal information over a sequence of frames.

- Flatten Layer: The input is flattened into a one-dimensional vector to be processed by fully connected layers.

- Dense Layers: Two dense (fully connected) layers with 64 units each and ReLU activation functions are applied successively.

- Output Layer: The final dense layer produces an output vector with a length equal to the number of actions (4 in our case). We used a linear activation function to output Q-values for each action, representing the expected future rewards for taking each action from the current state.

In [None]:
num_actions = 4
def create_q_model():
   
   inputs = layers.Input(shape=(84, 84, 4))

   flattened = layers.Flatten()(inputs) 

   dense1 = layers.Dense(64, activation="relu")(flattened)
   dense2 = layers.Dense(64, activation="relu")(dense1)
   output_layer = layers.Dense(num_actions, activation="linear")(dense2)

   return keras.Model(inputs=inputs, outputs=output_layer)

## Convolutional Neural Network

The network architecture is designed based on the Deepmind paper and specifically
tailored for training on Atari 2600. Convolutional layers are used to capture
spatial dependencies and patterns in the game frames. This is crucial for Atari
Breakout because it involves complex visual information, and convolutional layers
are effective in learning hierarchical features.

A dense neural layer, also known as a fully connected layer, didn't work as 
effectively for processing Atari Breakout frames as expected (we will see this)

Model Architecture:

![](./pictures/Network.png)

In [None]:
num_actions = 4
def create_q_model():
    # Network defined by the Deepmind paper
    inputs = layers.Input(shape=(84, 84, 4,))

    # Define the first convolutional layer
    # - 32 filters, each 8x8 in size
    # - Stride of 4, meaning the filter moves 4 pixels at a time
    # - ReLU activation function is applied to the output
    layer1 = layers.Conv2D(32, 8, strides=4, activation="relu")(inputs)

    # Define the second convolutional layer
    # - 64 filters, each 4x4 in size
    # - Stride of 2
    # - ReLU activation function
    layer2 = layers.Conv2D(64, 4, strides=2, activation="relu")(layer1)

    # Define the third convolutional layer
    # - 64 filters, each 3x3 in size
    # - Stride of 1
    # - ReLU activation function
    layer3 = layers.Conv2D(64, 3, strides=1, activation="relu")(layer2)

    # Flatten the output from the convolutional layers
    layer4 = layers.Flatten()(layer3)

    # Define a fully connected layer with 512 neurons
    # - ReLU activation function
    layer5 = layers.Dense(512, activation="relu")(layer4)

    # Output layer with num_actions neurons (4 in this case for the Breakout game)
    # - Linear activation function
    action = layers.Dense(num_actions, activation="linear")(layer5)


    return keras.Model(inputs=inputs, outputs=action)


# Setup

### Exploration-exploitation trade-off

As we know, in reinforcement learning, agents face the dilemma of whether to explore new actions or exploit known ones to maximize rewards. The exploration-exploitation trade-off is crucial for balancing between discovering potentially better actions and exploiting known optimal ones.

$\epsilon$ is a function of the number of frames the agent has seen. For the first 50000 frames the agent only explores ($\epsilon=1$). Over the following 1 million frames, $\epsilon$ is linearly decreased to 0.1, meaning that the agent starts exploiting more and more. DeepMind then keeps $\epsilon=0.1$, however, we chose to decrease it to $\epsilon=0.01$ over the remaining frames (24kk) as suggested by the [OpenAi Baselines for DQN](https://openai.com/research/openai-baselines-dqn) (in the plot the maximum number of frames is 2 million for demonstration purposes).

![](./pictures/epsilon.png)

In [None]:
# Configuration paramaters for the whole setup
seed = 42
gamma = 0.99  # Discount factor for past rewards
epsilon = 1.0  # Epsilon greedy parameter
epsilon_min = 0.1  # Minimum epsilon greedy parameter
epsilon_final = 0.01  # Minimum epsilon greedy parameter
epsilon_max = 1.0  # Maximum epsilon greedy parameter
epsilon_interval = (
    epsilon_max - epsilon_min
)  # Rate at which to reduce chance of random action being taken
epsilon_interval_2 = (
    epsilon_min - epsilon_final
)  # Rate at which to reduce chance of random action being taken after 1kk frames
# Number of frames to take random action and observe output
epsilon_random_frames = 50000.0   # Number of frames with epsilon set to 1.0
# Number of frames for exploration
epsilon_greedy_frames = 1000000.0 # Number of frames to linearly decay epsilon from 1 to 0.1
epsilon_final_frames = 24000000.0 # Number of frames to linearly decay epsilon from 0.1 to 0.01

## Atari wrappers

### NoopResetEnv 
This wrapper adds a random number of “no-op” (no-operation) actions to the start of each episode to introduce some randomness and make the agent explore more.
### FireResetEnv 
This wrapper automatically presses the “FIRE” button at the start of each episode, which is required for some Atari games to start.
### EpisodicLifeEnv
This wrapper resets the environment whenever the agent loses a life, rather than when the game is over, to make the agent learn to survive for longer periods.
### MaxAndSkipEnv
This wrapper skips a fixed number of frames (usually 4) and returns the maximum pixel value from the skipped frames, to reduce the impact of visual artifacts and make the agent learn to track moving objects.
### ClipRewardEnv
This wrapper clips the reward signal to be either -1, 0, or 1, to make the agent focus on the long-term goal of winning the game rather than short-term rewards.
### WarpFrame
This wrapper resizes and converts the game screen frames to grayscale to reduce the input size and make it easier for the agent to learn.
We modify this code to make it even more suitable just for breakout. In particular, instead of resizing the image from 210x160 to 84x84, we first crop the image to make it 160x160 (we remove the upper part which represents the actual score and remainings lifes), and then we apply the resizing.

Original Image:
![](./pictures/frame_00_delay-0.02s.png)

Cropped Image:
![](./pictures/cropped.png) Then resizing, finally;

Greyscale Image: 
![](./pictures/grey.png)

### ScaledFloatFrame
This wrapper scales the pixel values to be between 0 and 1 to make the input data more compatible with deep learning models. It's a sort of "brightness normalization"
### make_atari
This function creates an Atari environment with various settings suitable for deep reinforcement learning research, including the use of the NoFrameskip wrapper and a maximum number of steps per episode.
### wrap_deepmind
This function applies a combination of the defined wrappers to the given env object, including EpisodicLifeEnv, FireResetEnv, WarpFrame, ClipRewardEnv, and FrameStack. The scale argument can be used to include the ScaledFloatFrame wrapper as well.
### FrameStack
This wrapper stacks a fixed number of frames together to give the agent some temporal information and make it easier for the agent to learn the dynamics of the game.

Given this image, can you tell where the ball is going? 

![](./pictures/frame_00_delay-0.02s.png)

Of course no. What about this sequence instead?

![](./pictures/frame_00_delay-0.02s.png) ![](./pictures/frame_01_delay-0.02s.png) ![](./pictures/frame_02_delay-0.02s.png) ![](./pictures/frame_03_delay-0.02s.png)


We will see the results at the end...


In [None]:
# Use the Baseline Atari environment because of Deepmind helper functions
env = make_atari_breakout("BreakoutNoFrameskip-v4")
# Warp the frames, grey scale, stake four frame and scale to smaller ratio
env = wrap(env, frame_stack=True, scale=True)
env.seed(seed)

### Replay memory and other parameters

Experiences are stored in a replay buffer and the model is periodically trained using sampled batches from the buffer. We have given a maximum dimension for the memory like it is done in the Deepmind paper, but a different number since it we less computation power. Implementing a replay buffer can help the training process because experiences are often highly correlated temporally therefore using consecutive experiences directly for training can lead to instability and slow learning. Without a replay buffer, an agent might overfit to recent experiences and fail to generalize well.

In [None]:
# Experience replay buffers
action_history = []
state_history = []
state_next_history = []
rewards_history = []
done_history = []
episode_reward_history = []
batch_size = 32  # Size of batch taken from replay buffer
max_steps_per_episode = 10000 
# Maximum replay length
# Note: The Deepmind paper suggests 1000000 however this causes memory issues
max_memory_length = 100000
# Train the model after 4 actions
update_after_actions = 4
# How often to update the target network
update_target_network = 10000
running_reward = 0
episode_count = 0
frame_count = 0
terminal_life_lost = False

## Huber Loss

One other interesting thing to notice: DeepMind uses the quadratic cost function with error clipping (see page 7 of [Mnih et al. 2015](https://www.nature.com/articles/nature14236/)).

>We also found it helpful to clip the error term from the update [...] to be between -1 and 1. Because the absolute value loss function |x| has a derivative of -1 for all negative values of x and a derivative of 1 for all positive values of x, clipping the squared error to be between -1 and 1 corresponds to using an absolute value loss function for errors outside of the (-1,1) interval. This form of error clipping further improved the stability of the algorithm.

Why does this improve the stability of the algorithm?

>In deep networks or recurrent neural networks, error gradients can accumulate during an update and result in very large gradients. These in turn result in large updates to the network weights, and in turn, an unstable network. At an extreme, the values of weights can become so large as to overflow and result in NaN values. [Source](https://machinelearningmastery.com/exploding-gradients-in-neural-networks/)

This so-called exploding gradient problem can, to some extent, be avoided by clipping the gradients to a certain threshold value, if they exceed it: * If the true gradient is larger than a critical value $x$, just assume it is $x$.* Observe that the derivate of the green curve does not increase (or decrease) for $x>1$ (or $x<-1$).
Error clipping can be easily implemented in tensorflow by using the Huber loss function `tf.losses.huber_loss`.

![](pictures/huber.png)


In [None]:
# Using huber loss for stability
loss_function = keras.losses.Huber()
# In the Deepmind paper they use RMSProp however then Adam optimizer
# improves training time
optimizer = keras.optimizers.Adam(learning_rate=0.00025, clipnorm=1.0)

In [None]:
# The first model makes the predictions for Q-values which are used to
# make a action.
model = create_q_model()
# Build a target model for the prediction of future rewards.
# The weights of a target model get updated every 10000 steps thus when the
# loss between the Q-values is calculated the target Q-value is stable.
model_target = create_q_model()

In [None]:
csv_filename = "training_stats.csv"
# Check if the CSV file exists
if os.path.exists(csv_filename):
    with open(csv_filename, mode='r') as file:
        # CSV file already exists, read the header
        reader = csv.reader(file)
        header = next(reader)
else:
    # CSV file does not exist, create and write the header
    header = ["Episode", "Total Reward", "Epsilon", "Avg Reward (Last 100)", "Total Frames",
              "Frame Rate", "Model Updates", "Running Reward", "Training Time"]
    with open(csv_filename, mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(header)


## Train


Every time step, the agent chooses an action based on the epsilon, takes a step in the environment, stores this transition, then takes a random batch of 32 transitions and uses them to train the neural network. For every training item (s, a, r, s`) in the mini batch of 32 transitions, the network is given a state (stack of 4 frames, or s). Using the next state and the Bellman equation we get the targets for our neural network. Basically if the next state is a terminal state, meaning the episode has ended, then the target is equal to just the immediate reward. Otherwise, the state action pair should map to the value of the immediate reward, plus the discount multiplied by the value of next state’s highest value action. We can achieve this thanks to the done_sample array. Traditionally, the value of the next state’s highest value action is obtained by running the next state through the neural network we’re trying to train. But this can lead to oscillations and divergence of the policy. So instead, we use a target network which helps mitigate oscillations and divergence. The target network’s weights are updated to the weights of the training network every 10 000 time steps.

![](./pictures/bellman.png)


In [None]:
starting = datetime.datetime.now()
while True:  # Run until solved
    start_time = time.time()
    state = np.array(env.reset())

    current_lives = 5
    
    episode_reward = 0

    for timestep in range(1, max_steps_per_episode):
        # env.render(); Adding this line would show the attempts
        # of the agent in a pop up window.
        frame_count += 1

        # Use epsilon-greedy for exploration
        if frame_count < epsilon_random_frames or epsilon > np.random.rand(1)[0]:
            # Take random action
            action = np.random.choice(num_actions)
        else:
            # Predict action Q-values
            # From environment state
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_probs = model(state_tensor, training=False)
            # Take the best action
            action = tf.argmax(action_probs[0]).numpy()

        if frame_count > epsilon_random_frames: # Decay epsilon only after exploring for first 50k frames
            if epsilon > epsilon_min:
                # Decay probability of taking random action
                epsilon -= epsilon_interval / epsilon_greedy_frames
                epsilon = max(epsilon, epsilon_min)
            else:
                # Continue decaying epsilon linearly over the remaining frames
                epsilon -= epsilon_interval_2 / (epsilon_final_frames)
                epsilon = max(epsilon, epsilon_final)


        # Apply the sampled action in our environment
        state_next, reward, done, info = env.step(action)
        state_next = np.array(state_next)
            
        episode_reward += reward

        # When a life is lost, we save terminal_life_lost = True in the replay memory
        # N.B. We don't modify directly done, since done is already used to break the loop
        num_lives = info['lives']

        if (num_lives < current_lives):
            terminal_life_lost = True
            current_lives = num_lives
        else:
            terminal_life_lost = False

        # Save actions and states in replay buffer
        action_history.append(action)
        state_history.append(state)
        state_next_history.append(state_next)
        done_history.append(terminal_life_lost if not done else done) # If the game is not terminated, if life lost add true, else add done (False or true)
        rewards_history.append(reward)
        state = state_next

        # Update every fourth frame and once batch size is over 32
        if frame_count % update_after_actions == 0 and len(done_history) > batch_size:

            # Get indices of samples for replay buffers
            indices = np.random.choice(range(len(done_history)), size=batch_size)

            # Using list comprehension to sample from replay buffer
            state_sample = np.array([state_history[i] for i in indices])
            state_next_sample = np.array([state_next_history[i] for i in indices])
            rewards_sample = [rewards_history[i] for i in indices]
            action_sample = [action_history[i] for i in indices]
            done_sample = tf.convert_to_tensor(
                [float(done_history[i]) for i in indices]
            ) # turns True into 1.0 and False into 0.0.

            # Build the updated Q-values for the sampled future states
            # Use the target model for stability
            future_rewards = model_target.predict(state_next_sample)
            # Q value = reward + discount factor * expected future reward
            # updated_q_values = rewards_sample + gamma * tf.reduce_max(
            #    future_rewards, axis=1
            # )

            # Our Implementation
            # If the game is over because the agent lost or won, there is no next state and the value is simply the reward 

            updated_q_values = rewards_sample + (1 - done_sample) * gamma * tf.reduce_max(future_rewards, axis=1)

            # Create a mask so we only calculate loss on the updated Q-values (If action taken was 1, it create [0,1,0,0])
            masks = tf.one_hot(action_sample, num_actions)

            with tf.GradientTape() as tape:
                # Train the model on the states and updated Q-values
                q_values = model(state_sample)

                #  to the Q-values to get the Q-value for action taken
                q_action = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)
                # Calculate loss between new Q-value and old Q-value
                loss = loss_function(updated_q_values, q_action)

            # Backpropagation
            grads = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(grads, model.trainable_variables))

        if frame_count % update_target_network == 0:
            # update the the target network with new weights
            model_target.set_weights(model.get_weights())
            # Log details
            template = "running reward: {:.2f} at episode {}, frame count {}"
            print(template.format(running_reward, episode_count, frame_count))

        # Limit the state and reward history
        if len(rewards_history) > max_memory_length:
            del rewards_history[:1]
            del state_history[:1]
            del state_next_history[:1]
            del action_history[:1]
            del done_history[:1]

        if done:
            # print(info)
            break

    # Update running reward to check condition for solving
    episode_reward_history.append(episode_reward)
    if len(episode_reward_history) > 100:
        del episode_reward_history[:1]
    running_reward = np.mean(episode_reward_history)

    # Calculate additional statistics
    avg_reward_last_100 = np.mean(episode_reward_history[-100:])
    frame_rate = frame_count / (time.time() - start_time)
    training_time = time.time() - start_time

    # Append the episode statistics to the CSV file
    with open(csv_filename, mode='a', newline='') as file:
        writer = csv.writer(file)
        writer.writerow([episode_count, episode_reward, epsilon, avg_reward_last_100,
                            frame_count, frame_rate, len(done_history),
                            running_reward, training_time])
    
    if (episode_count%100 == 0):
        current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    
        print(f"{current_time} - Episode {episode_count} reached. Saving model in saved_models/model_episode_{episode_count}. . .")
        model.save("saved_models/model_episode_{}".format(episode_count))
        current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        print(f"{current_time} - Model saved.")
        current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        print(f"{current_time} - Saving target model. . .")
        # Save the target model
        model_target.save("saved_models/target_model_episode_{}".format(episode_count))
        current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        print(f"{current_time} - Target model saved in saved_models/target_model_episode_{episode_count}.")

    episode_count += 1
    if (num_lives==0):
        template = "running reward: {:.2f} at episode {}, frame count {}"
        print(template.format(running_reward, episode_count, frame_count))

    if running_reward > 40:  # 40 is the avg score of human beings
        print("Solved at episode {}!".format(episode_count))
        episode_count -= 1
        break

# Evaluation and Performance Comparison



## Training Comparison

### Training Process: Dense vs Convolutional

<div style="display:flex; justify-content:center;">
    <img src="./reward_plot_training_feed.png" style="width:50%; height:auto; margin-right:20px;">
    <img src="./reward_plot_training.png" style="width:50%; height:auto;">
</div>

### Some interesting insight: Using terminal life lost

<div style="display:flex; justify-content:center;">
    <img src="./reward_plot_wothout life_loss_terminal.png" style="width:50%; height:auto; margin-right:20px;">
    <img src="./reward_plot_with_life_loss_terminal.png" style="width:50%; height:auto;">
</div>

### Why stacking frames?
<div style="display:flex; justify-content:center;">
    <img src="./reward_plot_dual.png" style="width:50%; height:auto;">
</div>

# Evaluation Process

In [None]:
import imageio
import numpy as np
import cv2
import tensorflow as tf
import gym
import os
import pandas as pd
from breakout_wrapper import make_atari_breakout, wrap
import time
import gc

# Configuration parameters
seed = 42
num_actions = 4

# Use the Baseline Atari environment for testing
env = make_atari_breakout("BreakoutNoFrameskip-v4")
# Warp the frames, grey scale, stack four frames, and scale to a smaller ratio
env = wrap(env, frame_stack=True, scale=True, clip_rewards=False, episode_life=False)
env.seed(seed)

# Get the current working directory
current_directory = os.getcwd()

episode_count = 0
max_episodes = 10  # Set the desired number of episodes

# Load training stats CSV
training_stats_file = "training_stats.csv"  # Replace with the actual filename
training_stats_df = pd.read_csv(training_stats_file)

# Path to the saved model
model_filename = "saved_models/model_episode_{}".format(episode_count)

# Create the absolute path to the model file
absolute_model_path = os.path.join(current_directory, model_filename)

# Load the pre-trained model
loaded_model = tf.keras.models.load_model(absolute_model_path)

# Function to generate GIF from raw frames
def generate_gif(frame_number, frames_for_gif, reward, path, ep):
    imageio.mimsave(f'{path}{"ATARI_Breakout_Eval_model_{0}_reward_{1}.gif".format(ep, int(reward))}',
                    frames_for_gif, duration=1/30)
    print(f'Gif saved at {path}{"ATARI_Breakout_Eval_model_{0}_reward_{1}.gif".format(ep, int(reward))}')

# Function to choose an action based on the model's predictions with epsilon-greedy exploration
def choose_action(model, state):
    # Exploit: choose the action with the highest predicted value
    state_tensor = tf.convert_to_tensor(state)
    state_tensor = tf.expand_dims(state_tensor, 0)
    action_probs = model(state_tensor, training=False)
    # Take the best action
    action = tf.argmax(action_probs[0]).numpy()
    return action


for episode_count in range(0, training_stats_df['Episode'].max() + 1, 100):
    print(f'Testing episode {episode_count}. . .')
    # Path to the saved model
    model_filename = "saved_models/model_episode_{}".format(episode_count)

    # Create the absolute path to the model file
    absolute_model_path = os.path.join(current_directory, model_filename)

    # Load the pre-trained model
    loaded_model = tf.keras.models.load_model(absolute_model_path)

    # Initialize variables
    highest_reward = float('-inf')  # Variable to track the highest reward
    frames_highest_reward = []  # List to store frames associated with the highest reward

    # Test the model in the environment
    frames_for_gif = []  # List to store raw RGB frames for GIF generation
    current_lives = 5
    restart = True
    episode_reward = 0
    episode_counter = 0

    # Lists to store Q values during the episode
    q_values = []

    # Lists to store rewards for calculating the average over 10 episodes
    rewards_for_average = []

    frame_count = 0

    time.sleep(2)

    while episode_counter < max_episodes:
        frame_count += 1
        if restart:
            restart = False
            state = np.array(env.reset())
            state_next, reward, done, info = env.step(1)  # Play Fire Action
            state = np.array(state_next)

        # Capture the raw RGB frame for GIF generation
        raw_frame = env.render(mode='rgb_array')
        frames_for_gif.append(raw_frame)

        # Comment this and uncomment the other to play yourself
        action = choose_action(loaded_model, state)

        state_next, reward, done, info = env.step(action)
        state = np.array(state_next)

        num_lives = info['lives']
        print(info)
        print(done)
        print(num_lives)

        # Get the Q value for the chosen action
        q_value = np.max(loaded_model.predict(np.expand_dims(state, axis=0)))
        q_values.append(q_value)

        if num_lives < current_lives:
            state_next, reward, done, info = env.step(1)
            state = np.array(state_next)
            current_lives = num_lives
            print(num_lives)
        episode_reward += reward

        if done:
            print("Episode n " + str(episode_counter))
            # Store the maximum reward and frames with the highest rewards
            if episode_reward > highest_reward:
                highest_reward = episode_reward
                frames_highest_reward = frames_for_gif.copy()

            # Print the average of Q values
            avg_q_value = np.mean(q_values)
            print(f"Average Q Value: {avg_q_value}")

            # Store rewards for calculating the average over 10 episodes
            rewards_for_average.append(episode_reward)

            # Get the corresponding frame count from training stats
            episode_row = training_stats_df[training_stats_df['Episode'] == episode_count]
            if not episode_row.empty:
                frame_count_training = episode_row['Total Frames'].values[0]
                # Store episode_count, frame_count, avg_reward, max_reward, avg_q_value in your evaluation CSV

            # Reset variables for the next episode
            frames_for_gif = []
            episode_reward = 0
            state = np.array(env.reset())
            env.step(1)
            current_lives = 5
            episode_reward = 0
            restart = True
            episode_counter += 1

    # Generate GIF for the episode with the highest reward
    generate_gif(0, frames_highest_reward, highest_reward, 'recordings/', episode_count)

    # Print the average reward over 10 episodes
    avg_reward_over_10_episodes = np.mean(rewards_for_average)    
    print("Individual Rewards for Each Episode:", rewards_for_average)
    print(f"Average Reward Over 10 Episodes: {avg_reward_over_10_episodes}")
    # Store episode_count, frame_count, avg_reward, max_reward, avg_q_value in your evaluation CSV
    evaluation_data = {
        'episode_count': episode_count,
        'frame_count_training': frame_count_training,
        'avg_reward': avg_reward_over_10_episodes,
        'max_reward': highest_reward,
        'avg_q_value': avg_q_value
    }
    # Append the evaluation data to your CSV file
    with open('evaluation_stats.csv', 'a') as f:
        pd.DataFrame([evaluation_data]).to_csv(f, header=f.tell()==0, index=False)  # Append without header if the file is not empty

    print(f'Testing episode {episode_count} finished. . .')
    del loaded_model
    time.sleep(2)

## Evaluation Results

## Reward (score)
<div style="display:flex; justify-content:center;">
    <img src="./reward_plot_evaluation_reward_feed.png" style="width:50%; height:auto; margin-right:20px;">
    <img src="./reward_plot_evaluation_reward.png" style="width:50%; height:auto;">
</div>

## Q-values
<div style="display:flex; justify-content:center;">
    <img src="./reward_plot_evaluation_q_value_feed.png" style="width:50%; height:auto; margin-right:20px;">
    <img src="./reward_plot_evaluation_q_values.png" style="width:50%; height:auto;">
</div>
