## Pacman environment and how the game works - 

In Pacman, the player earns points by eating pellets and avoiding monsters (contact with one causes Ms.PacMan to lose a life)[src: https://en.wikipedia.org/wiki/Ms._Pac-Man ].

The game has four different mazes that appear in different color schemes, and alternate after each of the game's intermissions are seen. The pink maze appears in levels 1 & 2, the light blue maze appears in levels 3, 4, & 5, the brown maze appears in levels 6 through 9, and the dark blue maze appears in levels 10 through 14. After level 14, the maze configurations alternate every 4th level.
Three of the four mazes (the first, second, and fourth ones) have two sets of warp tunnels, as opposed to only one in the original maze.

The walls have a solid color rather than an outline, which makes it easier for a novice player to see where the paths around the mazes are.

#### Possible movements = 9, left, right, up, down, centre, upper-left, upper-right, lower-left, lower-right. 


## Using DQN for training Pacman in Open AI Gym - 

There are 250 pellets that can be eaten by Ms Pacman. Neural Networks are very good at learning large number of features for highly structured data. Hence, we can use the idea of Q-learning where the Q-function will be represented by a neural network with states and actions as the inputs and returns Q-values as the outputs. 

To make the agent learn to play the game on its own, we feed it to a Deep Learning model, thereby using DQN to estimate best Q-values for the agent in its environment. Thus, in the DQN, we have 4 quantities - state (s), action (a), reward (r), next state (s'). So, our Q-table would be generated using the following idea - 

* Use forward propagation to predict Q-values for current state s and all actions a.

* Generate the maximum overall networks $Q_{max}(s,a)$ for the next state s.

* Generate Q-values for target action using the following formula - 

$$ y(s,a) = r + \gamma.max_{a'} Q_{target}(s', a')$$


### Preprocessing the Pacman environment for feeding to the Neural Network

Each state for the Pacman environment is basically a frame of the game screen which is stored as a numpy array - 3D as an RGB image of the screen. 

Feeding a 3D array to the NN is difficult to train and highly computationally expensive. Hence, we do the basic preprocessing as follows - 

* Scale down the image / frame down to 88 x 80 grid size.
* Next, we convert the RGB image to the greyscale and 
* Generate the contrasted image for sharper pixels

We develop our Deep Learning Model as a Convolutional Neural Network with the following specifications - 

* 3 convolutional layers
    * 1st layer - 32 nodes, 8x8 convolution mask, strides = 4, downsized to shape = 22x20x32, relu
    * 2nd layer - 64 nodes, 4x4 convolution mask, strides = 2, downsized to shape = 11x10x64, relu
    * 3rd layer - 64 nodes, 3x3 convolution mask, strides = 1, downsized to shape = 11x10x64, relu
    
* 2 fully connected layers
    * 1st layer = 512 nodes, relu, shape = 11x10x64
    * 2nd and final layer = 9 nodes for 9 different possible actions
    
Apart from the normal agent's DQN, we also need to maintain another Deep learning model for the target. So, we define a DQN model for the target as well, which estimates the next state's Q-values for each possible action to compute the target Q-values for training the agent DQN. 
    


In [1]:
import gym
import numpy as np
import random
from collections import deque

In [2]:
from collections import deque
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Conv2D, MaxPooling2D
from keras.optimizers import Adam

Using TensorFlow backend.


In [3]:
class Pacman_Agent:
    def __init__(self, state_size, action_size):
        # Define environment parameters
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=5000)
        
        # Define Hyperparameters to learn through deep learning
        self.gamma = 0.95            # discount rate
        self.epsilon = 1.0          # exploration rate to start
        self.epsilon_min = 0.1      # minimum exploration rate (epsilon-greedy)
        self.epsilon_decay = 0.995  # decay rate for epsilon
        self.learning_rate = 0.001
        self.update_rate = 1000     # steps needed until target network gets updated
        
        # Construct DQN models
        self.model = self._build_model()
        self.target_model = self._build_model()
        self.target_model.set_weights(self.model.get_weights())
        self.model.summary()

    # Build the CNN for DQN on the pacman environment 
    def _build_model(self):
        model = Sequential()
        
        # Defining the Conv Layers - EXTRACTING FEATURES FROM ENVIRONMENT
        # 1st layer - output size = 32 nodes, kernel size = 8x8, skipping steps (strides) = 4 pixels,
        # input_size = initial state size of the environment
        model.add(Conv2D(32, (8, 8), strides=4, padding='same', input_shape=self.state_size))
        model.add(Activation('relu'))
        
        #2nd layer - output size = 64 nodes, kernel size = 4x4, strides reduced to 2 pixels
        model.add(Conv2D(64, (4, 4), strides=2, padding='same'))
        model.add(Activation('relu'))
        
        #3rd layer - output size = 64 nodes, kernel size = 3x3, strides reduced to 1 pixel
        model.add(Conv2D(64, (3, 3), strides=1, padding='same'))
        model.add(Activation('relu'))
        model.add(Flatten())

        # Defining the Fully Connected Neural Nets to classify features
        # 1st layer - input size = 512 nodes, 2nd layer - output size = number of possible actions of the agent
        model.add(Dense(512, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        
        #generate model
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        return model

    #Function to save past experiences as replays
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    # Function to select actions based on epsilon-greedy method
    def act(self, state):
        # Explore randomly
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        
        #Generate probability of using each random action
        actions = self.model.predict(state)
        return np.argmax(actions[0])  # Returns action using policy

    # Randomly select some actions and train the agent from previos replay experiences 
    def replay(self, batch_size):
        # perform replay on a small batch of the stored experiences 
        minibatch = random.sample(self.memory, batch_size)
        
        # checking if the agent has reached target
        for state, action, reward, next_state, done in minibatch:
            if not done:
                target = (reward + self.gamma * np.amax(self.target_model.predict(next_state)))
            else:
                target = reward
                
            # Making new targets
            # Output the Q-value predictions
            target_f = self.model.predict(state)
            
            # Update the chosen action value with the new computed target
            target_f[0][action] = target
            
            # Train the new model based on the new target and current state
            # for next possible action
            self.model.fit(state, target_f, epochs=1, verbose=0)
        
        # decay the epsilon every time it grows larger than the minimum epsilon 
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    #Update the target model based on the current computed values
    def update_target_model(self):
        self.target_model.set_weights(self.model.get_weights())
            
    #load a saved model
    def load(self, name):
        self.model.load_weights(name)
    
    #save parameters of trained models
    def save(self, name):
        self.model.save_weights(name)

In [4]:
# Function that helps in joining 4 images together after training
# and returns an average of the combined images
def join_images(images, state_size = (88, 80, 1)):
    new_dim_Arr = np.zeros((state_size), np.float64)
    avg_image = np.expand_dims(new_dim_Arr, axis=0)

    for img in images:
        avg_image += img
        
    if len(images) < 4:
        return avg_image / len(images)
    else:
        return avg_image / 4

In [5]:
# Function for preprocessing each frame, src: github.com/ageron/tiny-dqn
def process_frame(frame, state_size = (88, 80, 1)):
    pacman = np.array([210, 164, 74]).mean()
    img = frame[1:176:2, ::2]    # Crop and downsize
    img = img.mean(axis=2)       # Convert to greyscale
    img[ img==pacman ] = 0 # Improve contrast by making pacman white
    # Normalize each image between -1 and 1.
    img = img/128
    new_dim_img = img.reshape(state_size)
    
    return np.expand_dims(new_dim_img, axis=0)

#### Let's define the environment and associated parameters

In [6]:
env = gym.make('MsPacman-v0')
state_size = (88, 80, 1)
action_size = env.action_space.n
agent = Pacman_Agent(state_size, action_size)

EPISODES = 100
batch_size = 8 #we use 8 elements in a batch at a time for Pacman

# wait for 90 actions before an episode begins
skips = 90  

total_time = 0   # keeping track of total number of steps taken
all_rewards = 0  # Used to compute avg reward over time

#initial done report = False
done = False

rewards_over_episodes= []
eps_over_episodes= []
score_over_episodes = []

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 22, 20, 32)        2080      
_________________________________________________________________
activation_1 (Activation)    (None, 22, 20, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 10, 64)        32832     
_________________________________________________________________
activation_2 (Activation)    (None, 11, 10, 64)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 11, 10, 64)        36928     
_________________________________________________________________
activation_3 (Activation)    (None, 11, 10, 64)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 7040)              0         
__________

#### Run episodes

Start training the agent using DQN and preprocessing of each frame 

In [None]:
for ep in range(EPISODES):
    total_reward = 0
    score= 0
    
    # Reset environment for every episode, preprocess it and blend 4 images together in every episode
    state = process_frame(env.reset())
    images = deque(maxlen=4)  
    images.append(state)
    
    # skip actions before start of each episode
    for _ in range(skips): 
        env.step(0)
    
    # run each episode for a certain time limit
    for time in range(1000):
        env.render()
        total_time += 1
        
        # Every update_rate timesteps we update the target network parameters
        if total_time % agent.update_rate == 0:
            agent.update_target_model()
        
        # Return the avg of the last 4 frames
        state = join_images(images)
        
        # Perform interaction of the agent with the environment 
        # to get the dynamics and steps to take
        action = agent.act(state)
        next_state, reward, done, _ = env.step(action)
        
        # Return the avg of the last 4 frames after processing and
        # combining the images
        next_state = process_frame(next_state)
        images.append(next_state)
        next_state = join_images(images)
        
        # Store replay memory
        agent.remember(state, action, reward, next_state, done)
        
        #Update the state
        state = next_state
        
        # Add the reward
        score+= reward
        
        reward -= 1  # Discount to avoid collection of rewards
        total_reward += reward
        
        if done:
            all_rewards += score
            
            print("episode: {}/{}, game score: {}, reward: {}, avg reward: {}, time: {}, total time: {}"
                  .format(ep+1, EPISODES, score, total_reward, all_rewards/(ep+1), time, total_time))
            
            break
        
        # if agent learns more than the batch size, send it the stored
        # replay experiences of that batch size
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)
        
    # save epsilon and reward values for each episode
    eps_over_episodes.append(agent.epsilon)
    rewards_over_episodes.append(total_reward)
    score_over_episodes.append(all_rewards)
    
    # Plot curve of reward vs episode after every 100 episodeps_over_episodes
    if (ep+1) % 20 == 0:
        plt.figure()
        plt.title('Curve of reward vs episodes after completion of ' + str(ep+1) + ' episodeps_over_episodes')
        plt.plot(rewards_over_episodes)
        plt.show()

# plot epsilon and reward curves at the end of the episodes        
plt.figure()
plt.title('Final curve of reward vs episodes')
plt.plot(score_over_episodes)
plt.show()

plt.figure()
plt.title('Final curve of score vs episodes')
plt.plot(rewards_over_episodes)
plt.show()

plt.figure()
plt.title('Epsilon-Decay curve')
plt.plot(eps_over_episodes)
env.close()
