# GridWorld env 

This is a simple 2D env built by Arthur Juliani which poses full MDP.
This gives out images as states, each of which contain an actor, some villans & some heroes.
The actor has to reach out heroes for a positive reward, avoid villans to skip the negative reward. 
I'm not sure if we're paying price for time spent travelling or if we can reach out multiple actors to maximize reward.

Let's explore more and figure out more.

In [14]:
import random
from pprint import pprint

In [3]:
from gridworld import gameEnv

# Get env object
env = gameEnv(partial=False,size=5)

In [4]:
# Number of actions
env.actions  

4

In [5]:
# All objects in current game
env.objects  

[<gridworld.gameOb instance at 0x10f91bf80>,
 <gridworld.gameOb instance at 0x10f91f3b0>,
 <gridworld.gameOb instance at 0x10f91f2d8>,
 <gridworld.gameOb instance at 0x10f91f4d0>,
 <gridworld.gameOb instance at 0x10f91f0e0>,
 <gridworld.gameOb instance at 0x10f91f320>,
 <gridworld.gameOb instance at 0x10f91f290>]

In [11]:
# Object attributes
object_ = env.objects[random.randint(0, len(env.objects) - 1)]
object_.name, object_.size, (object_.x, object_.y), (object_.channel, object_.intensity)

('goal', 1, (3, 1), (1, 1))

In [20]:
# Objects stats
from collections import Counter
objects = [object_.name for object_ in env.objects]
Counter(objects)

Counter({'fire': 2, 'goal': 4, 'hero': 1})

In [6]:
# Init env
state = env.reset()
state.shape

(84, 84, 3)

In [23]:
def get_action_random(state):
    '''
    Returns randomly one action from [0, 1, 2, 3]
    '''
    return random.choice([0, 1, 2, 3])

get_action_random(state=None)

3

In [25]:
# Play a game to get an idea on the dynamics
state = env.reset()
done = False
counter = 0

while not done:
    action = get_action_random(state)
    state, reward, done = env.step(action)
    

    print("Reward for choosing action: {} is {}".format(
            action, reward
    ))

    counter += 1
    
    if counter > 100:
        break

print("Game ended after {} units of time".format(
        counter
))

Reward for choosing action: 2 is 1.0
Reward for choosing action: 0 is 0.0
Reward for choosing action: 0 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 2 is 0.0
Reward for choosing action: 3 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 3 is 1.0
Reward for choosing action: 2 is 0.0
Reward for choosing action: 0 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 3 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 3 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 2 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 1 is 0.0
Reward for choosing action: 2 is 0.0
Reward for choosing action: 2 is 0.0
Reward for choosing action: 2 is 0.0
Reward for choosing action: 2 is 0.0
R

## Observations
* Looks like, every time step, `hero` has to move in one of known four directions. 
* If he passes through `fire`, a -1 but passes through a `goal`, a +1, other-wise a zero.
* There is no `end` condition for this env, so the goal must be to maximise total reward.

* Another thing, this env is dynamically populated. Meaning, the objects are randomly placed for each episode, so roting won't help the env. 

<br>

# Visual Deep Q Agent

## Architecture (Vanilla)



In [27]:
from keras.models import Model
from keras.layers import (
    Conv2D,
    Dense,
    MaxPooling2D,
    Flatten,
    Input
)

In [28]:
import random
import numpy as np
from gridworld import gameEnv

In [29]:
# Architecture parameters
kernel_size = (9, 9)
pool_stride = (2, 2)
conv_activation = "tanh"
dense_activation = "relu"
output_activation = "linear"
lr = 0.1
optimiser = "adam" # (default)

In [38]:
class GridEnvDQNAgent(object):
    '''
    An agent which exposes `learn` and `demo` methods for learning GridEnv env 
    and running a demo what it has learnt.
    '''
    
    def __init__(self, exploration_prob=1.0, exploration_prob_min=0.1, max_episodes=10000, max_episode_length=500, batch_size=20):
        '''
        Tunable parameters (partial list)
        Apart from these, there are optimiser types, learning rates, decay rates; 
        model architecture parameters; reward disgestion mechanism as variables.
        '''
        self.exploration_prob = exploration_prob
        self.exploration_prob_min = exploration_prob_min
        self.max_episodes = max_episodes
        self.max_episode_length = max_episode_length
        self.batch_size = batch_size
        self.agent = self.build_DQN(input_dims=(84, 84, 3), number_of_actions=4)
        self.env = gameEnv(partial=False,size=5)
        self.pick_policy = "!random"  # Set this to any other value to train agent with most recent few experiences P .  
        self.experiences = []
        self.rewards = []
        
    def build_DQN(self, input_dims, number_of_actions):
        '''
        A Deep network which takes state as input & outputs action weights.
        Essentially this learns the Q table for solving CartPole, hence the name Deep-Q-Network.
        '''
        self.input_dims = input_dims
        input_ = Input(shape=input_dims)
        
        c1 = Conv2D(activation=conv_activation, filters=32, kernel_size=kernel_size, padding="SAME")(input_)
        c1 = MaxPooling2D(strides=pool_stride, pool_size=pool_stride)(c1)
        
        c2 = Conv2D(activation=conv_activation, filters=128, kernel_size=kernel_size, padding="SAME")(c1)
        c2 = MaxPooling2D(strides=pool_stride, pool_size=pool_stride)(c2)
        
        c3 = Conv2D(activation=conv_activation, filters=256, kernel_size=kernel_size, padding="SAME")(c2)
        c3 = MaxPooling2D(strides=pool_stride, pool_size=pool_stride)(c3)
        
        flattened = Flatten()(c3)
        
        d1 = Dense(1024, activation=dense_activation)(flattened)
        d2 = Dense(512, activation=dense_activation)(d1)
        output_ = Dense(number_of_actions, activation=output_activation)(d2)
        
        agent_model = Model(inputs=[input_], outputs=[output_])
        
        agent_model.compile(
            optimizer=optimiser,
            loss="mse",
            metrics=["accuracy"]
        )
        
        print(agent_model.summary())
        
        return agent_model
    
    
    def get_greedy_action(self, state):
        '''
        Given a state, estimates action probability distribution using DQN (agent).
        Then it would return most valuable action with a probability of `exploration_prob`
        '''
        if random.random() <= self.exploration_prob:
            action = random.choice([0, 1, 2, 3])
        else:
            actions_pd = self.agent.predict(state.reshape(self.reshape_dims))
            action = np.argmax(actions_pd)

        return action
        
    def get_training_data(self, pick_policy):
        '''
        When invoked, returns at most self.batch_size number of experiences 
        from self.experiences
        '''       
        mini_batch_size = min(self.batch_size, len(self.experiences))
        if pick_policy == "random":
            # Pick randomly 
            mini_batch = random.sample(
                population=self.experiences,
                k=mini_batch_size
            )
        else:
            # Pick last few
            mini_batch = self.experiences[-mini_batch_size:]
        
        trainX, trainY = [], []
        for x, y in mini_batch:
            trainX.append(x)
            trainY.append(y)

        return np.array(trainX), np.array(trainY)
    
    def decay_exploration_prob(self):
        '''
        Slowly decays the exploration probabilty with runtime
        '''
        self.exploration_prob -= 1. / self.max_episodes
        self.exploration_prob = max(self.exploration_prob, self.exploration_prob_min)
    
    def learn(self):
        '''
        Initialises env, plays & updates DQN to the optimal Q
        '''   
        # Play a maximum of `max_episodes` number of games
        for game in range(1, self.max_episodes + 1):
            
            self.experiences = []
            current_state = self.env.reset()
            accum_reward = 0
            
            # Stock experiences into buffer of length max number of time units a game should sustain
            for time_unit in range(self.max_episode_length):
                
                # Take an action (exploratorily) and see how it rewards now 
                action = self.get_greedy_action(state=current_state)
                next_state, reward, done = self.env.step(action)
                
                # Now update the action_pd responsible for above action by including 
                # current reward and value we'd achieve by following optimal policy 
                # from now on, given we took above action
                self.reshape_dims = tuple([-1]) + self.input_dims
                target_action_pd = self.agent.predict(current_state.reshape(self.reshape_dims))[0]
                target_action_pd[action] = reward + 0.99 * np.max(self.agent.predict(next_state.reshape(self.reshape_dims))[0])
                if done is True:  # In this env, done is never true.
                    target_action_pd[action] = -1
                
                # Stock the experiences
                self.experiences.append([current_state.tolist(), target_action_pd.tolist()])
                current_state = next_state
                accum_reward += reward
                
                # Check if it's time to update policy (train DQN) 
                if time_unit % self.batch_size == 0 or done is True:
                    trainX, trainY = self.get_training_data(pick_policy=self.pick_policy)
                    self.agent.train_on_batch(trainX, trainY)
                
                # Exit if game's over
                if done is True:  # In this env, done is never true.
                    break
            
            self.rewards.append(accum_reward)
            self.decay_exploration_prob()
            
            stat_update_freq = 2
            if game % stat_update_freq == 0:
                avg_reward = np.array(self.rewards[-100:]).mean()
                print("Avg. reward over last {0:d} is {1:3.2f}. Last played game#{2:d}".format(
                    stat_update_freq, avg_reward, game
                ))
                if avg_reward > 195.0:
                    print("Done solving... ")
                    print("""(CartPole-v0 defines "solving" as getting average reward 
                    of 195.0 over 100 consecutive trials.)""")
                    break
            
            
    def demo(self):
        '''
        Run this to render the performance of trained model.
        Be sure to first train the model.
        '''
        total_reward = 0
        state = self.env.reset()
        
        done = False
        while not done:
            action = np.argmax(self.agent.predict(state.reshape(-1, 4))[0])
            state, reward, done = self.env.step(action)
            self.env.render()

In [39]:
# RL Hyperparameters
exploration_prob = 1.0
exploration_prob_min = 0.1
max_episodes = 10000
max_episode_length = 50
batch_size = 32

In [40]:
# Instantiate GridEnvDQNAgent object.
# This object should be trained afresh everytime a notebook is restarted
# Consider saving weights of trained agent to make demo work out of box
DQNAgent = GridEnvDQNAgent(
    max_episodes=max_episodes,
    exploration_prob_min=exploration_prob_min,
    batch_size=batch_size,
    max_episode_length=max_episode_length
)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         (None, 84, 84, 3)         0         
_________________________________________________________________
conv2d_22 (Conv2D)           (None, 84, 84, 32)        7808      
_________________________________________________________________
max_pooling2d_22 (MaxPooling (None, 42, 42, 32)        0         
_________________________________________________________________
conv2d_23 (Conv2D)           (None, 42, 42, 128)       331904    
_________________________________________________________________
max_pooling2d_23 (MaxPooling (None, 21, 21, 128)       0         
_________________________________________________________________
conv2d_24 (Conv2D)           (None, 21, 21, 256)       2654464   
_________________________________________________________________
max_pooling2d_24 (MaxPooling (None, 10, 10, 256)       0         
__________

In [None]:
# Explore, sync reward and learn
DQNAgent.learn()

Avg. reward over last 2 is 3.00. Last played game#2
Avg. reward over last 2 is 2.50. Last played game#4
Avg. reward over last 2 is 2.83. Last played game#6
Avg. reward over last 2 is 2.88. Last played game#8
Avg. reward over last 2 is 2.10. Last played game#10
Avg. reward over last 2 is 2.58. Last played game#12
Avg. reward over last 2 is 2.07. Last played game#14
Avg. reward over last 2 is 2.00. Last played game#16
Avg. reward over last 2 is 2.17. Last played game#18
Avg. reward over last 2 is 1.75. Last played game#20
Avg. reward over last 2 is 1.73. Last played game#22
Avg. reward over last 2 is 1.79. Last played game#24
Avg. reward over last 2 is 1.73. Last played game#26
Avg. reward over last 2 is 1.82. Last played game#28
Avg. reward over last 2 is 1.73. Last played game#30
Avg. reward over last 2 is 1.94. Last played game#32
Avg. reward over last 2 is 1.79. Last played game#34
Avg. reward over last 2 is 1.86. Last played game#36
Avg. reward over last 2 is 1.82. Last played game#