# Deep-Q-Learning

In this notebook I explore the concepts of Deep-Q-Learning by following the paper of DeepMind [Volodymir Mnih et al (2013)](https://arxiv.org/pdf/1312.5602.pdf) which was published back in 2013.

As I'm intrigued by the topic of autonomous driving, the [MIT lecture videos by Lex Fridman](https://deeplearning.mit.edu/) introduced me to the topic of DQL motivated me to take a deep dive and get familiar with its concepts.

In [20]:
import gym
import numpy as np

import tensorflow as tf
from tensorflow import keras
from keras.layers import Conv2D, Flatten, Dense

## Preview of the game

Use **left** and **right arrows** to move the spaceship sideways and the **space bar** to use the main cannon and shoot the aliens!

If you want to move to the left/right and shoot simultaneously then use **s/d**.

As you play, the `PlayPlot` function (marked deprecated) will plot the immediate award for the actions you take. For this uncomment the callback call in the `play` function. 

In [9]:
import pygame
from gym.utils.play import play, PlayPlot

def compute_metrics(obs_t, obs_tp, action, reward, terminated, truncated, info):
    return [reward, np.linalg.norm(action)]

plotter = PlayPlot(
    compute_metrics,
    horizon_timesteps=200,
    plot_names=["Immediate Rew.", "Action Magnitude"]
)

my_env = gym.make("SpaceInvaders-v4", render_mode="rgb_array")
mapping = {(pygame.K_SPACE,): 1, (pygame.K_RIGHT,): 2, (pygame.K_LEFT,): 3, (pygame.K_d,): 4, (pygame.K_s,): 5}
play(my_env, keys_to_action=mapping) #, callback=plotter.callback)

  deprecation(
A.L.E: Arcade Learning Environment (version 0.8.0+919230b)
[Powered by Stella]
[31mERROR: PlayableGame wrapper works only with rgb_array and rgb_array_list render modes, but your environment render_mode = None.[0m
  logger.warn(


## Some random action taken

In [None]:
import gym
env = gym.make("SpaceInvaders-v4", render_mode="human")
observation, info = env.reset(seed=42)
for _ in range(1000):
    action = choose_action(env, observation, 0.3)
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()
env.close()

## Starting to define some functions

In [3]:
def choose_action (env, obs, eps):
    # get a random number
    random_number = np.random.rand()
    
    if random_number < eps:
        # we explore
        action = env.action_space.sample()
        print('we explore!')
        
    else:
        # we choose the action yielding the highest reward according to our model
        action = env.action_space.sample()
        print('we choose the best action!')
    
    return action
    

For the first attempt we'll use the architecture of the neural network described on page 6 in [Volodymir Mnih et al (2013)](https://arxiv.org/pdf/1312.5602.pdf).

In [44]:
class neural_network:

    def __init__ (self, obs_shape, action_shape, learning_rate):
        # define a Sequential CNN using Keras predefined layers
        # Here we'll use the same architecture as in Volodymir Mnih et al (2013)
        self.model = keras.Sequential()
        self.model.add(Conv2D(filters=16, kernel_size=(8, 8), strides=(4, 4), padding="same", activation="relu", input_shape=(84, 84, 4)))
        self.model.add(Conv2D(filters=32, kernel_size=(4, 4), strides=(2, 2), padding="same", activation="relu", input_shape=(84, 84, 4)))
        self.model.add(Flatten())
        self.model.add(Dense(256, activation="relu"))
        self.model.add(Dense(action_shape, activation="relu"))

        self.model.compile(loss=keras.losses.Huber(), optimizer=keras.optimizers.Adam(learning_rate=learning_rate), metrics=['accuracy'])
        
    def preprocess_input (self, obs_shape, frames):
        # define a preprocess function for the frames recieving from the Atari games
        # we'll follow again the preprocess function proposed by Volodymir Mnih et al (2013)
        
        # we take the last 4 frames from the sequence
        # will it be a simple list or something else?
        frames = frames[-4:]
        # first we convert the RGB representation to gray-scale
        preprocessed_frames = tf.image.rgb_to_grayscale(frames)
        # then we size the frames down from 210x160 to 110x84
        preprocessed_frames = tf.image.resize(preprocessed_frames, [110,84])
        # finally we crop out the 84x84 region that captures roughly the playing area
        # will try to add it later, they did it because they were using GPU 2D CNs which expect square inputs
        
        sequence = np.concatenate((preprocessed_frames), axis=2) # stack the four frames next to each other to get 110x84x4
        
        


In [33]:
class replay_memory:
    def __init__(self, capacity):
        self.capacity = capacity
        self.memories = []
        
    def store_transition(self, transition):
        # store transition in replay_memory
        return 0
    
    def sample_minibatch(self):
        # sample a minibatch of transition from replay_memory
        return 0

In [29]:
N = 10
learning_rate = 0.001

my_env = gym.make("SpaceInvaders-v4", render_mode="rgb_array")
obs_shape = my_env.observation_space.shape
action_shape = my_env.action_space.n

num_of_episodes = 10

In [38]:
def train_DQL():
    ## initialize replay memory D
    memories = replay_memory(N)
    
    ## initialize action-value function Q with random weights
    ## The action-value function in DQL is a neural network, so we'll initiate a model
    main_model = neural_network(obs_shape, action_shape, learning_rate)
    
    ## I may need a target model to save the "older" weights of the model, was not mentioned in paper
    ## target_model = neural_network(obs_shape, action_shape, learning_rate)
    
    ## start outer loop of the number of episode we'll train the model for
    for episode in range(num_of_episodes):
        print('helloo')
        observation, info = my_env.reset()
        
        ## initialise sequence s1 = {x1} and preprocessed the sequence
        
        while (info[lives] > 0):
            

In [47]:
observation, info = my_env.reset()
observation, info["lives"]

(array([[[ 0,  0,  0],
         [ 0,  0,  0],
         [ 0,  0,  0],
         ...,
         [ 0,  0,  0],
         [ 0,  0,  0],
         [ 0,  0,  0]],
 
        [[ 0,  0,  0],
         [ 0,  0,  0],
         [ 0,  0,  0],
         ...,
         [ 0,  0,  0],
         [ 0,  0,  0],
         [ 0,  0,  0]],
 
        [[ 0,  0,  0],
         [ 0,  0,  0],
         [ 0,  0,  0],
         ...,
         [ 0,  0,  0],
         [ 0,  0,  0],
         [ 0,  0,  0]],
 
        ...,
 
        [[80, 89, 22],
         [80, 89, 22],
         [80, 89, 22],
         ...,
         [80, 89, 22],
         [80, 89, 22],
         [80, 89, 22]],
 
        [[80, 89, 22],
         [80, 89, 22],
         [80, 89, 22],
         ...,
         [80, 89, 22],
         [80, 89, 22],
         [80, 89, 22]],
 
        [[80, 89, 22],
         [80, 89, 22],
         [80, 89, 22],
         ...,
         [80, 89, 22],
         [80, 89, 22],
         [80, 89, 22]]], dtype=uint8),
 3)

## Deep-Q-learning Algorithm

<img src="./DQL-algorithm.png" alt="dql" width="850"/>