# HW \#1: Deep Q Networks: VFA for Q-Learning

**Name:**  <font color="red">Alexander Johnston</font>



# I. Overview

## Objective/Approach
The objective of this assignment is to train a Reinforcement Learning model to defeat the beginner Pong AI from the Gymnasium Library. implemented a simple DQN model using the PyTorch library to achieve this. Gifs of the model at different parts of training are saved in the model_gifs file.

Files in this assignment:

assign2 - this file, contains last run

model_gifs - I was unable to make a model observer, so I incorporated the making of gifs into this assign2 file. These gifs have been saved in model_gifs



Also, no my program didn't error out I ended it early on accident.

## Explanation of the pong problem

The Pong Problem, consists of a simple 2-player game where:
-The left paddle is controlled by the environment/ai in relation to its difficulty level.
-The right paddle is controlled by the Reinforcement Learning Agent
-The victory condition is to score 21 points against the opponent
-Points are acquired when you reflect the ball past the opponenet's paddle.


Environmental factors:
Steps are somewhat equivilent to frames, as in atari games player actions are understood on what frame their action is taken. 
Therefore, steps (or in my case groups of 4 steps) recieve 1 action from each player, which is reflected in the next frame.

Episodes are equivilent to 1 game, and reward the sumnation of all rewards gained during itself. The episode ends when 1 side achieves 21 points. To understand the score of an episode, a negative reward indicates the model lost, with the increasing volume of negative score indicating how hard it lost, and vice versa with positive score/victory.

This environment consists only of a 210 by 160 pixel image with 3 colors, which is the only output that the Model is able to observe. The model has an action space consisting of 6 possible actions, including up and down, which are by far the most important.

To emulate reward, the Model is fed a reward of +1 when it scores, and -1 when the opponent scores, incentivising actions that lead to a victory, and discouraging defeat.


To defeat pong, one should train a model on which actions result in the highest rewards based on the current state of the game, and the model must come to understand how to maximize long term rewards to optimize scoring and resisting scores on itself.



# II. Problem


## Pong

Pong is one of the simplest Atari game as in the figure below. The goal of the game is winning the pingpong/tennis-like game by scoring 21 first. A player scores one point when the opponent hits the ball out of bounds or misses a hit. The right paddle is controlled by your RL agent and left paddle is controlled by a computer. 

![pong](https://ale.farama.org/_static/videos/environments/pong.gif)

#### STEPS for Pong

1. [II Problems] First, import gymnasium (if you haven't installed it, make sure install gymnasium first.
1. [II Problems] Initialize, learn and test how the environment works.
1. [II Problems] Explain the environment code.
3. [III Methods] Build your own Deep Q Network (DQN). 
4. [III Methods] Explain your RL agent (DQN) with review of VFA and how it is implemented.
5. [IV Results]  Discuss your hyperparameter search process. 
5. [IV Results]  Explain your final setup and discuss the agent's performance. 




## Explain Environment code

For the environment, I had to import the usual suspects (numpy, matplotlib, random, gym), along with supporting libraries such as the ale_py atati python library, as well as gymnasiums wrapper classes.

In regards to the variables, I set a few variables for different training options (easy medium hard) that correspond with the difficulty number gymnasium associates them with. For this assignment, I only did easy as my GPU was having overheating issues.

To create the environment, I ran the following:
env = gym.make("ALE/Pong-v5", render_mode="rgb_array", difficulty=easy)
env = ResizeObservation(env, (64, 64))
env = GrayscaleObservation(env, keep_dim=True)

The env.make() and it's inputs simply create the environment similar to the last assignment.
The first real thing of interest was my resizing of the environment and subsequent grayscaling. I implemented these out of concern for the amount of memory used by my program, but this usage was heavily inspired by a youtube video (https://www.youtube.com/watch?v=vaVBd9H2eHE) I watched explaining these as well. These primarily came in handy for limiting the amount of memory I preallocated, allowing my older GPU to handle more.

In regards to choosing 64, 64, I considered doing 84, 84 as those were the values used in the OpenAI DQN paper I read through, but I went with 64x64 as it seemed to be sufficient in the examples I saw online. In addition, I believed it would speed up training and reduce my limited memory as I am training on an older gpu. I saw an explanation stating if you think you could play the game if it was divided into information squares with divided by those numbers, it would be sufficient, which I concurred.

One thing I thought was interesting was the grayscale, in which I was able to reduce the amount of channels from 3 to 1, but I still wish to do some more research on how this affected the training.

Later on, similar to the last assignment, I used env.reset() to reset the pong environment along with the other env. commands, but I will not go over them too heavily in detail considering this was covered in the last assignment. I typically used the general env methods to gain information about the environment, such as action_dimensions=env.action_space.n to acquire basic information about actions. 

After initialization, to interact between gymnasium and pytorch, I would get the obs from env.reset(), and would pass it to a pytorch tensor. This came with the related issue that pytorch expected my channels in a different order, so throughout my code I would either use something like this:

obs, info = env.reset()
obs = obs.transpose(2, 0, 1)

Or, I would use the a slightly modified method shown by the youtube video (https://www.youtube.com/watch?v=vaVBd9H2eHE)
def proc_obs(obs):
    obs = torch.tensor(obs,dtype=torch.float32).permute(2,0,1)
    return obs

This information would then be fed into my networks, and eventually an action would be made, and the environment would be informed.

The last interaction with gymnasium to cover is rewards, which simply would award/penalize based on the assertions made in my pong explanation earlier.

## ENVIRONMENT CODE BELOW

In [1]:
# Add your code for setting up the environment
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Green_FN
import numpy as np
import random
import time
import ale_py as ale
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display, clear_output
from gymnasium.wrappers import ResizeObservation, GrayscaleObservation
easy = 1
normal = 2
hard = 3

env = gym.make("ALE/Pong-v5", render_mode="rgb_array", difficulty=easy)
env = ResizeObservation(env, (64, 64))
env = GrayscaleObservation(env, keep_dim=True)


A.L.E: Arcade Learning Environment (version 0.10.2+c9d4b19)
[Powered by Stella]


<font color="gray">[Delete Me] add your writing: You can refer to our previous assignment.</font>



# III. Methods

## Describe your neural network function approximator (how many hidden unites? why?).

My neural network function approximator was made up of the following:
- 3 Convolutional Layers
- 3 Linear nn Layers
- An addition Linear Layer that maps the neurons to the 6 actions.

Explaining Hidden units/parameters:

The Convolutional Layers contained scaling channels, and I chose 1 due to the 1 color channel resulting from the grayscale. It then was scaled up to 32 through the convolutional nns. The 3 convolutional layers seemed to be enough for both 84x84 and 64x64 as the DQN paper I read, along with 2 githubs and the youtube videos all used similar setups. I went with the #of Kernalds and stride used in the youtube video as my knowledge of these values is limited, but to my understanding the kernal size is the looking glass, while stride is moving the looking glass, so leaving stride as half of the kernal size seemed apt.

The linear layers, contrary to the code below (which state a default 256), is given the parameter later "hidden_layers = 128".
The Deepmind paper used 512 hidden layers, and the video I watched used 256, but I felt that due to my lack of memory space, I should use 128 even if it might have more difficulty learning. In addition, pong is a simple game, so I did not think that reducing the amount of hidden layers used by other experiments would be too detrimental to my success.

Lastly, the final_layer simply connects the final layer of neurons to the 6 actions. I considered making only 2 actions selectable as I don't understand the need for the other actions, but I left it be as it seemed out of scope at this time.

One thing of note in my forward method, was the h = h.view(h.size(0), -1).
This simply flattened the non batch_size dimensions into a single vector for the fully connected layers.

For the calculate_convolutional_output, it was used to measure what the input for the initial fc layer should be.

One additional thing, weights_init is COPIED from "https://www.youtube.com/watch?v=vaVBd9H2eHE" as I had never used weights before and wanted to incorporate them for learning. The code could be ran without these weights but it seemed to improve early performance.

The initial weights used were kaiming normal for conv2d and xaiver normal for the linear weights. It was interesting comparing full zero weights with having these weights during my experimentation.



## Explain your codes / GENERAL CODE STRUCTURE EXPLANATION

For the other sections, my explanations of their codes are provided above their texts below. EX: The buffer class and the game loop both have explanations above them. Here, I will give an explanation of the entirity and structure of my code.

In general, for this to work you need the following:

A buffer using normal memory, a training loop that uses the gpu, an agent/model, and an environment.

My code is divided into the following classes/sections:

Initialization (above code that starts env and code before the training loop that sets up all of my parameters

environment
-Simply operates out of the gymnasium class, which I defined earlier. Does not have much associated code.
-Provides ongoing states, and we provide it with our action taken

py_model
-This class initializes and operates my neural networks
-It takes in observation and operates on it

Buffer
-This class takes in previous runs and stores them in the cpu memory for later access
-It primarily takes in np while the pymodel takes in tensors

game loop
-This section acts as the agent, (its basically a disgusting double loop)
-Basically has a loop of all episodes, which contains the step loop
-Sends and recieves all information from the above classes and handels information transfer


## py_model Class explanation

Already explained above, below sections will have explanations in this format.

In [2]:
#FOR THIS CLASS I LEARNED HOW TO DO THIS FROM THE YOUTUBE VIDEO LINKED BELOW
#I DID NOT TAKE ML SO I HAD TO LEARN PYTORCH FROM SCRATCH
#https://www.youtube.com/watch?v=vaVBd9H2eHE
#Convolutional NN + relu maps out the features, which are then flattened into input layer
#this flattened layer is fed into a fully connected layer
#it is then outputted
class py_model(nn.Module):
    def __init__(self, action_dimension, hidden_dimensions=256, observation_shape=None):
        super(py_model, self).__init__()
        #THIS INITIALIZES ALL OF THE LAYERS
        #Convolutional Layers
        #3 Layers due to decreasing the image 
        self.convolutional_layer_1 = nn.Conv2d(in_channels=1, out_channels=8, kernel_size=4, stride=2)

        self.convolutional_layer_2 = nn.Conv2d(in_channels=8, out_channels=16, kernel_size=4, stride=2)

        self.convolutional_layer_3 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=4, stride=2)

        convulational_layer_output_size = self.calculate_convolutional_output(observation_shape)

        self.fully_connected_layer_1 = nn.Linear(convulational_layer_output_size, hidden_dimensions)
        self.fully_connected_layer_2 = nn.Linear(hidden_dimensions, hidden_dimensions)
        self.fully_connected_layer_3 = nn.Linear(hidden_dimensions, hidden_dimensions)
        
        self.final_layer = nn.Linear(hidden_dimensions, action_dimension)
        #weights initialization taken from video
        self.apply(self.weights_init)
    

    def forward(self, h):
        #actual running
        h = h / 255
        h = Green_FN.relu(self.convolutional_layer_1(h))
        h = Green_FN.relu(self.convolutional_layer_2(h))
        h = Green_FN.relu(self.convolutional_layer_3(h))
        h = h.view(h.size(0), -1)
        h = Green_FN.relu(self.fully_connected_layer_1(h))
        h = Green_FN.relu(self.fully_connected_layer_2(h))
        h = Green_FN.relu(self.fully_connected_layer_3(h))
        output = self.final_layer(h)
        return output
    #--------------------------------------------------------------------------------------------
    #These methods were mostly taken and modified from the previously mentioned from the youtube video, instead of written
    def calculate_convolutional_output(self, observation_shape):
        h = torch.zeros(1, *observation_shape)
        h = Green_FN.relu(self.convolutional_layer_1(h))
        h = Green_FN.relu(self.convolutional_layer_2(h))
        h = Green_FN.relu(self.convolutional_layer_3(h))
        return h.view(-1).shape[0]

    def weights_init(self, m):
        if isinstance(m, nn.Conv2d):
            nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
            if m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.Linear):
            nn.init.xavier_normal_(m.weight)
            if m.bias is not None:
                nn.init.constant_(m.bias, 0)


## Buffer Class Explanation

This buffer class, as mentioned before, stores the historical runs of the model and keeps the gpu from being overloaded with memory.
By maintaining a structued memory of past states, rewards, outcomes, and actionsm the model can go back to old data and reuse it.

def __init__(self, max_size, input_shape, device='cpu')

This method initializes the buffer and preallocates all of the memory required
input shape takes the shape of the grayscale image, device tells what type of memory is to be allocated, and max_size
refers to the maximum amount of transitions that can be stored in the buffer.

In this method, we preallocate for 
past states, next states, actions taken, rewards, and if an episode ended.

def determine_samplability()
-this method simply returns if more than 10x the batch_size experiences have been stored
-this ensure a sufficient base amount of exploration is done before we even are able to call upon our past experiences


def store_transition(self, state, action, reward, next_state, game_complete)
This method stores an experience in to the replay buffer.
We take in our self, state, action, reward, next_state, game_complete
which is what we observed in our state.

The structure flows like this
overwrite old experience
stores memory
increments memory index counter


def sample_buffer(self, batch_size)
This method samples a batch of random past experiences according to the number batch_size.

It does this through the following execution steps:
if buffer isnt full, max_memory is set to the minimum of mem counter to see the valid samplable range
randomly picks indices of experiences in memory
retrieves all of the associated experience stuff from it in memory
moves it to the gpu and returns it for the game loop to use for action determination

In [3]:
#---------------------REPLAY BUFFER CLASS---------------------
#I used the tutorial from-----https://www.youtube.com/watch?v=vaVBd9H2eHE
#I did not know how to code this, so I learned from this video.
#I did not copy this section, but I rewrote it after watching and learning from the video
#again, i have never done rl before so I had to start somewhere

class ReplayBuffer:
    def __init__(self, max_size, input_shape, device='cpu'):
        #MEMORY ALLOCATION TO NORMAL RAM
        self.memory_size = max_size
        self.memory_counter = 0
        #Creates memory and fills with zeros initially, multidimensional
        self.state_memory = np.zeros((self.memory_size, *input_shape), dtype=np.uint8)
        #creates next state memory and fills it with zeros, multidimensional
        self.next_state_memory = np.zeros((self.memory_size, *input_shape), dtype=np.uint8)
        #Action and reward memory, single dimensional
        self.action_memory = np.zeros(self.memory_size, dtype=np.uint8)
        self.reward_memory = np.zeros(self.memory_size, dtype=np.float32)
        #Terminal memory
        self.terminal_memory = np.zeros(self.memory_size, dtype=bool)
        #Setting method device as cpu
        self.device = device #Basically, all this stuff is stored on the normal RAM, not gpu RAM

        
    def determine_sampleability(self, batch_size):
        if self.memory_counter > (batch_size * 10):
            return True
            print("HELLO")
        else:
            return False

        #actions - action_memory
        #rewards - reward memory
        #next_states - next_state_memory
        #state - state_memory
        #completed_games - terminal_memory - game_complete - when game is done
    
    def store_transition(self, state, action, reward, next_state, game_complete):
        #Calculating how many things I have stored in memory
        #This rewrites the buffer after it reaches a certain point
        #It preallocates, and writes over when it get to that point
        memory_index = self.memory_counter % self.memory_size
        #Current state is where it is in memory, next state is the next area in memory past each mem_size
        self.state_memory[memory_index] = state.cpu().numpy().astype(np.uint8)
        self.next_state_memory[memory_index] = next_state.cpu().numpy().astype(np.uint8)

        #Doing the above for the action/reward/terminal_memory as well
        self.action_memory[memory_index] = torch.tensor(action).detach().cpu()
        self.reward_memory[memory_index] = reward
        self.terminal_memory[memory_index] = game_complete

        #iterating the memory counter, we have gone to the next state stored in memory
        self.memory_counter +=1 

    def sample_buffer(self, batch_size):
        #retrieving a random batch of past experiences
        #ensures it doesnt retrieve from unallocated memory
        max_memory = min(self.memory_counter, self.memory_size)
        #this ensure we randomly choose different parts of memory
        batch = np.random.choice(max_memory, batch_size)

        #here, i 
        states = self.state_memory[batch]
        next_states = self.next_state_memory[batch]
        actions = self.action_memory[batch]
        rewards = self.reward_memory[batch]
        completed_games = self.terminal_memory[batch]

        states = torch.tensor(states, dtype=torch.float32).to(self.device)
        next_states = torch.tensor(next_states, dtype=torch.float32).to(self.device)
        actions = torch.tensor(actions, dtype=torch.float32).to(self.device)
        rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device)
        completed_games = torch.tensor(completed_games, dtype=torch.bool).to(self.device)

        return states, actions, rewards, next_states, completed_games

## Explanation of Misc Methods

gradual update provides a soft update to the target network, creating stability

proc_obs basically prevents me from having to permute every time obs is mentioned.

Gym -> Torch has a weird interaction where gym's channel/length/stuff is ordered in the oppisate, so this methods moves it around and changes the gym numpy variable to a torch variable. This was made due to gym obs being given so often in the training loop.

In [4]:
def gradual_update(target, source, tau=0.005):
    for target_param, param in zip(target.parameters(), source.parameters()):
        target_param.data.copy_(target_param.data * (1.0 - tau) + param.data * tau)

def proc_obs(obs):
    obs = torch.tensor(obs,dtype=torch.float32).permute(2,0,1)
    return obs

## Explanation of params and various declarations

This section is mostly covered in the hyper parameters section, so I will go over what is not covered over there.

self_repeat was left over from an earlier version
the initial env.reset() and obs.transpose are leftover from some testing

episode_steps is tracking for one of the charts, and at one point i had it prevent episodes from using more than 10k steps,
because of an earlier code bug that i resolved.

frame_action_repeat is explained below in the loop, it is used for repeating an action for n frames, making the paddle more consistant

device is set to cuda because i am using the gpu for training in this section

the 4 arrays are for tracking and briefly covered below

print(f'loaded model on device {device}') was testing as a result of earlier errors with my nvidia drivers

Everything else here is covered in the hyperparams explanations.

In [5]:
#Writing the actual algorithm
obs, info = env.reset()
obs = obs.transpose(2, 0, 1)
self_repeat = ""
batch_size = 64
epsilon = 1
min_epsilon = .1
epsilon_decay = .998
hidden_layers = 128
learning_rate = 0.0001
episodes = 10000
episode_steps = 0
max_episode_steps = 10000
frame_action_repeat = 4
device = 'cuda'
gamma = 0.99
print(f'Loaded model on device {device}')
episode_rewards = []
episode_steps_list = []
all_actions_taken = []
rolling_avg_rewards = []

Loaded model on device cuda


## Explanation of Training loops/code

This section is about to be extremely long due to the complexity of the model, and my inability to split it without breaking it.
Next time I am 100% making a seperate agent class to allow for better and clearer execution flow.

First of all, before the loop, we have a bit more setup.
We declare memory which is our cpu memory to store our buffer, as well as the following

memory 
model (our actual model)
target model (target model)
optimizer (basic adam optimizer I saw used in deepmind and openai papers. not really sure what it does but thought id mess with it)
we also set steps taken, which is a general tracker for the loop


for the actual loop it begins with 
for episode in range (episodes):
    game_complete=False
    reward_from_this_episode=0

    obs, info = env.reset()
    obs = proc_obs(obs)

In this section, we declare our episode loop which loops 1 time per episode,
set our game_complete flag to false, which lets us know when the pong match is over (pong match = episode)
we also get our reward from the episode and set it to zero.

Lastly, we reset our environment and ensure obs is ready for gpu usage.

    -------------------
    Next, we enter our game loop
    
    while game_complete is False and episode_steps<max_episode_steps:
    
    This loop occurs every episode, and contains our logic for episodes.
    We let this run until the episode finishes, or we use 10k steps, which used to be a bug in the code.

    We then use our greedy epsilon to determine if we use known knowledge or use known knowledge.
    We add whatever action we used to our array for one of our plots, then move on.
    
    We set our reward to zero, then begin a tertiary loop.
        for i in range(frame_action_repeat):
                t = 0
                next_obs, t, game_complete, truncated, info = env.step(action=action)
                next_obs = proc_obs(next_obs)
                reward+= t
                if (game_complete):
                    break

        In this tertiary loop, we basically have weaker steps for 4 steps, reducing compute cost.

    following this, we store our experience, and iterate our important markers, like steps taken and episode_steps
    
    memory.store_transition(obs, action, reward, next_obs, game_complete)

        obs = next_obs
        reward_from_this_episode += reward
        episode_steps +=1
        steps_taken +=1


    If enough data is present in our buffer, we then train the model based on this data.
    We do this by retreiving a batch of previous data if allowed by our determine samplability method.
    We compute the q vals of the current states
    Compute q vals in the target model
    find best action for next state
    retrieve next q from target model
    compute our bellman target
    We also adjust our target model every 4 steps
    
    We then do a gradual update to the target model depending on if we are on every 4th step or not.

The game loop then ends, and we are back in the episode loop.

We reduce our epsilon, 
add values for our different performance measuring metrics,
compute rolling average for the matplotlib,

In the output box, on thing of note is that episode steps are reset.

## Explanation of Plots

For the plots, all that you really need to know is that the following arrays pick up data from various
parts of the loop, and are thrown into various matplots. Its really simple and very apparant when it happens
The plot code is centralized near the end of the episode loop, but before the early end condition is checked and the episode steps are reset.
These plots update once per episode.

episode_rewards = []

episode_steps_list = []

all_actions_taken = []

rolling_avg_rewards = []


In [6]:
output_box = widgets.Output()
display(output_box)
device = 'cuda'
memory = ReplayBuffer(max_size=1000000, input_shape=obs.shape, device=device)

model = py_model(action_dimension=env.action_space.n, hidden_dimensions=hidden_layers, observation_shape=obs.shape).to(device)

target_model = py_model(action_dimension=env.action_space.n, hidden_dimensions=hidden_layers, observation_shape=obs.shape).to(device)

target_model.load_state_dict(model.state_dict())

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

steps_taken=0

for episode in range (episodes):
    game_complete=False
    reward_from_this_episode=0

    obs, info = env.reset()
    obs = proc_obs(obs)
    
    while game_complete is False and episode_steps<max_episode_steps:
        #Random action for exploring
        if random.random() < epsilon:
            action = env.action_space.sample()
        #exploiting our knowledge
        else:
            q = model.forward(obs.unsqueeze(0).to(device))[0]
            action = torch.argmax(q, dim=-1).item()
        all_actions_taken.append(action)
        reward = 0

        for i in range(frame_action_repeat):
            t = 0
            next_obs, t, game_complete, truncated, info = env.step(action=action)
            next_obs = proc_obs(next_obs)
            reward+= t
            if (game_complete):
                break

        memory.store_transition(obs, action, reward, next_obs, game_complete)

        obs = next_obs
        reward_from_this_episode += reward
        episode_steps +=1
        steps_taken +=1

        if memory.determine_sampleability(batch_size):
            states, actions, rewards, next_states, completed_games = memory.sample_buffer(batch_size)
            
            completed_games = completed_games.unsqueeze(1).float()
            
            q = model(states)
            
            actions = actions.unsqueeze(1).long()

            qsa_batch = q.gather(1, actions)
            
            next_actions = torch.argmax(model(next_states), dim=1, keepdim=True)
            
            next_q = target_model(next_states).gather(1, next_actions)
            
            target_b = rewards.unsqueeze(1) + (1-completed_games) * gamma * next_q

            loss = Green_FN.mse_loss(qsa_batch, target_b.detach())

            model.zero_grad()
            loss.backward()
            optimizer.step()

            if steps_taken % 4 == 0:
                gradual_update(target_model, model)
    """    
    print("-----------")
    print(f"Episode #{episode}")
    print(f"Steps taken this episode {episode_steps}")
    print(f"Episode reward: {reward_from_this_episode}")
    print(f"Total Steps Taken: {steps_taken}")
    print(f"Current Epsilon Value: {epsilon}")
    print("-----------")
    """
    if epsilon > min_epsilon:
        epsilon*= epsilon_decay
    episode_rewards.append(reward_from_this_episode)
    episode_steps_list.append(episode_steps)

    if len(episode_rewards) >= 100:
        rolling_avg_rewards.append(np.mean(episode_rewards[-100:]))
    else:
        rolling_avg_rewards.append(np.mean(episode_rewards))  # Use all available data initially
    latest_rolling_avg = rolling_avg_rewards[-1]  
    with output_box:
        clear_output(wait=True)
        print(f"Completed Episode: {episode}")
        print(f"Steps this Episode: {episode_steps}")
        print(f"Episode Reward: {reward_from_this_episode}")

        fig, axs = plt.subplots(2, 2, figsize=(12, 10))
        axs[0, 0].plot(episode_rewards, label="Reward Per Episode", color="blue")
        axs[0, 0].set_title("Reward Trend Over Time")
        axs[0, 0].set_xlabel("Episodes")
        axs[0, 0].set_ylabel("Reward")
        axs[0, 0].legend()
        axs[0, 1].plot(episode_steps_list, label="Steps Per Episode", color="green")
        axs[0, 1].set_title("Steps Per Episode")
        axs[0, 1].set_xlabel("Episodes")
        axs[0, 1].set_ylabel("Steps")
        axs[0, 1].legend()
        axs[1, 0].plot(rolling_avg_rewards, label="Last 100 Episode Reward Avg", color="red")
        axs[1, 0].set_title("Last 100 Episode Reward Average")
        axs[1, 0].set_xlabel("Episodes")
        axs[1, 0].set_ylabel("Average Reward")
        axs[1, 0].legend()
        axs[1, 1].bar(range(env.action_space.n), np.bincount(all_actions_taken, minlength=env.action_space.n), color="purple")
        axs[1, 1].set_title("Action Distribution")
        axs[1, 1].set_xlabel("Action")
        axs[1, 1].set_ylabel("Count")
        axs[1, 1].set_xticks(range(env.action_space.n))
        axs[1, 1].legend()
        plt.tight_layout()
        plt.show()
        episode_steps = 0      
        
    if episode % 1000 == 0:
        frames = []
    
        # Reset environment for GIF recording
        obs, info = env.reset()
        obs = proc_obs(obs)
        game_complete = False
        steps_taken_gif = 0  # Track steps during GIF recording
    
        while not game_complete:
            with torch.no_grad():  
                q = model.forward(obs.unsqueeze(0).to(device))[0]
                action = torch.argmax(q, dim=-1).item()
    
            reward = 0
            for _ in range(frame_action_repeat):  # Ensure action repeat is applied
                next_obs, t, game_complete, truncated, info = env.step(action)
                next_obs = proc_obs(next_obs)
                reward += t
                frames.append(env.render())  # Capture frame at each step

            if game_complete:
                break  # Stop recording if the game ends
        
            obs = next_obs  # Move to next observation
            steps_taken_gif += 1  # Increment step count
    
            # Ensure the model updates every 4 steps, just like in training
            if steps_taken_gif % 4 == 0:
                gradual_update(target_model, model)
    
        # Save GIF
        import imageio
        gif_path = f"pong_training_episode_{episode}.gif"
        imageio.mimsave(gif_path, frames, duration=0.05)
        print(f"Saved training gameplay as {gif_path}")
        print(f"Frames in gif: {len(frames)}")

Output()

Saved training gameplay as pong_training_episode_0.gif
Frames in gif: 764
Saved training gameplay as pong_training_episode_1000.gif
Frames in gif: 3880
Saved training gameplay as pong_training_episode_2000.gif
Frames in gif: 3328


KeyboardInterrupt: 

# IV - Results

- Describe the choice of your hyper-parameters for $\gamma$, $\epsilon$, the learning rates $\rho$'s, and the number of hiddend units (or other NN hyper-parameters). 
  - Run experiments to find good hyper-parameters
  - Show the experimental outputs to show the process of your selection
- Visualize the results and explain outputs 
  - Run the codes and tell me what you observe
  - Add more visualizations to enrich your explanation.
    - Hint: example visualization can be the reward/return curve, win/lose plot, score plot, etc. 
    -       Feel free to try new plots if you want. 

## The below code was used because the training i saved for you to view ended before it reached the end condition.
## It still successfully beat the pong many times

# Hyper Parameters section: Explanation and Code
## Note: I do not have code for hyperparams testing, I did it manually for some of it in previous iterations, explanations below

Explain your previous hyperparameters, results and give a few snippets showing hyperparam testing without writing any actual hyperparam tests.

I'll explain how I came to/tested the following parameters:
batch_size = 64
-For batch size, I tried both 32 and 64 on a few runs, and I noticed that the average would increase to -15 a few hundred episodes quicker, so I stuck with it. I originally used 32 as I was worried about memory and I saw it as the default for chatgpt, but I changed it when my memory issues were resolved after my buffer implementation.

epsilon = 1
-Starting at 1 for epsilon seemed completely logical.

min_epsilon = .1
-For min epsilon, I chose .1 as I wanted the model to still retain adaptability after its early learning stages. I tried 0 but obviously it stopped improving after 2200 episodes so I added .1 to it.

epsilon_decay = .998
-For epsilon decay, I initially went with .99 as shown in the previously mentioned youtube video, but I was having issues with my model
getting stuck around -15. I changed it to slow the epsilon growth, which meant it wouldn't hit min_epsilon for over 1000 episodes, which when given significant training time, lead to a really strong growth.

hidden_layers = 128
-For this parameter, it was chosen simply due to a early concern for memory, and it being a power of 2. I could probably change it to 256 and improve my training, but my training seemed to be performing just fine at the moment. I think its a good number considering the deep_mind paper used 512 and had much better resources than me (I think I have 8gb of gpuram)


learning_rate = 0.0001


episodes = 35000
This number was set arbitrarily as a result of the time it took to do my episodes. My episodes on average take 2.46 seconds, so I figured I would try to make my training last exactly 1 day. It seems like 5000 episodes as shown in the current training is sufficient to win games about 50% of the time, so if I were to do the full training it would likely end with a more succesfull model.

(Note i changed this to 10000 as i needed to retrain before the deadline.

frame_action_repeat = 4
This param forces am action to repeat for 4 frames inside of the loop, more accurately representing how atari games worked with their slower fps. (I also did this to cheat training time) It was a shown param in the video and I found it to be pretty interesting.

gamma = 0.99
This was done to make the engine heavily consider long term rewards. Never really had any issues with this value, and I used it as my default in my last assignment I believe.


## Visualization/test

Due to issues I had with creating a model_tester, I instead created gifs of every 1000 episodes and put it in a file called model_gifs.
This program will create gifs every 1000 episodes in the same directory it is in.

## As seen above many runs succesfully beat the game.

## Analysis and Observation of learning results and plots

Completed Episode: 5126
Steps this Episode: 865
Episode Reward: 4.0

The above is basic print text that repeats after every episode. All it does is display the last episode completed, # of steps this episode, and shows the reward. the episode reward of 4 shows that I succesfully completed the assignment.

The 4 plots are the following

Reward trend over time:
This plot shows every episodes reward plotted on a basic episodexreward table.
It aptly shows the most important parts of this experiment, and that the first successful victory occured at episode 1700~

Steps per episode:
This plot shows the steps per episode, which I thought would be an interesting metric to observe. My predictions for this would be that as the reward each episode approaches 0, the steps per run would increase, after which they would decrease.
As you can see, when my reward gains began to plateau around 0 average, this metric also began to stabilize.

Last 100 episode reward average:
This metric was selected as I wanted a smoother and more clear understanding of how the model was progressing.
It took the last 100 episodes as stored in rolling_avg_rewards and averaged them, creating a slow to change but
consistantly accurate performance metric. 

It seems that my model seemed to slow down its progress significantly around the 4000 episode mark, winning games about 50% of the time.


Action distribution:
This was the last metric I thought would be interesting, as it provided an insight into how the action selection occured. This metric was more useful in the actual running, as you could see the leveled actions slowly start to favor 3 and 5, which are up and down respectively.


# V. Conclusions

Discuss the challenges or somethat that you learned. 
If you have any suggestion about the assignment, you can write about it. 

## My conclusions are below

For this assignment, I have the following conclusions.

Reinforcement learning is hard. Very hard. I had a ton of trouble with environment setup for this assignment and a lot of trouble getting the memory to work consistently. 

Another thing: Performance increases when found are massive. Moving the buffer out of my gpu memory increased the buffer space I could hold by 1000x. In addition, having adequate buffer size masively increased performance. I was really happy when instead of taking a day to train 5000 episodes like I heard some of my classmates talking about, my model took 1 hour per 1450ish episodes.

I learned a ton. At the start of this, I had never touched pytorch, and never truley felt like I had succesfully written a ml model. This is the first time I have actually had one come together and perform somewhat well.

Having a stable linux environment helps a massive amount. I wiped my computer drive and installed linux mint, which let me have a really easy to manage docker environment. This fixed 90% of my environment instantly and allowed me to focus on coding, and I got to experience using my own GPU for machine learning for the first time. The feeling of going to sleep at 4am and waking up at 10am to 6500 episode being done and your model outperforming itself by a ton is a fantastic feeling.

Lastly, I enjoyed the assignment but I likely started too late. If I were to start earlier, I would have likely been able to achieve the extra credit points and could have explored more models, optimizers, and weights, instead of using ones suggested by youtube videos and research papers.

# Submission

You are required to submit three files. 
1. This notebook with complete writing. **assign2**
2. Stored keras or pytorch model(s) (if you complete the extra credit assignment, you should submit 2 model files) **t**
3. Another notebook that load the model and test the trained model on the pong environment. **model_tester.py**



## Grading

For this assignment, the grading rubric is a bit different. Please check it carefully. 


points | | description
--|--|:--
5 | Overview| states the objective and the appraoch 
20 | Problem | 
 |10| Code for setting up the environment
 | 5| Explanation of Pong problem
 | 5| explanation of the codes to use Pong from Gymnasium
25 | Methods | 
 |15| Your DQN agent codes
 |10| Explanation of your implementation and DQN method
35 | Results 
 | 5| Codes for hyperparameter search 
 |10| Experimental outputs that show the choice of parameters. Explanation of how do you choose them?
 |10| Visualization of learning and learned agent
 |10| Observations and analysis of learning results and plots
5 | Conclusions 
10 | Successful test of the submitted model