# **Using the fragile framework as a memory explorer to train a neural network in Atari games**

In this a tutorial we explain how to use the **fragile framework** as an explorer runner useful to memorize high quality memory reply status in order to train a neural network in the OpenAI gym library. It covers how to instantiate all the training process using any Python Jupyter simply running all the cells.

This code has been designed and tested using the Google Colab environment: https://colab.research.google.com/

You should visit and understand before continuing the [getting started tutorial](https://github.com/FragileTech/fragile/blob/master/examples/01_getting_started.ipynb )

# **The main point**

The main point after using here the **fragile framework** is the possibility of training a neural network model in any OpenAI Gym game without the necesity of using a huge random memory reply pack and neither the use of a suplementary target network as usually done in the DQN (Deep Q Learning) reinforcement learning technics.

With the use of the fragile framework we can direclty generate useful and "small" memory reply packs to use directly in the fit process of the model in a supervised learning way.

**Note:**

It's very important to understand that we don't use the reward of every step process. We use a imitation learning method where the model try to imitate what the best fragile framework walker inside the swarm made during its history tree.

# **Results**

This algorithm is able to reach using only a few training runs (and a very small memory reply set) the average score reached by other RL methods like DQN using millions of training steps and a very big memory reply set.

The test was made using the game: **SpaceInvaders**

Human average: ~372

DDQN average: ~479 (128%)

Ours average: ~500

In the game **Atlantis**, our code reach the human average score in more or less 4 training runs: ~25000

# **Note:**

There are even a lot of hyperparameters to play with in order to improve these results ;).

**We first install all the requirements needed to run the code**

In [0]:
!pip install numpy > /dev/null 2>&1
!pip install gym > /dev/null 2>&1
!pip install keras > /dev/null 2>&1
!pip install matplotlib > /dev/null 2>&1
!pip install opencv-python > /dev/null 2>&1
!pip install tensorflow > /dev/null 2>&1
!pip install PIL > /dev/null 2>&1
!pip install git+https://github.com/FragileTech/plangym.git > /dev/null 2>&1
!pip install fragile > /dev/null 2>&1
!pip install fragile["all"] > /dev/null 2>&1
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1

**We declare some helping classes and methods**

In [0]:

"""
Source: https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py
"""
import numpy as np
from collections import deque
import gym
from gym import spaces
import cv2

from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only

import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()


def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env_video(env):
  env = Monitor(env, './video', force=True)
  return env


class NoopResetEnv(gym.Wrapper):
    def __init__(self, env=None, noop_max=30):
        """Sample initial states by taking random number of no-ops on reset.
        No-op is assumed to be action 0.
        """
        super(NoopResetEnv, self).__init__(env)
        self.noop_max = noop_max
        self.override_num_noops = None
        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

    def reset(self):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset()
        if self.override_num_noops is not None:
            noops = self.override_num_noops
        else:
            noops = np.random.randint(1, self.noop_max + 1)
        assert noops > 0
        obs = None
        for _ in range(noops):
            obs, _, done, _ = self.env.step(0)
            if done:
                obs = self.env.reset()
        return obs

class FireResetEnv(gym.Wrapper):
    def __init__(self, env=None):
        """For environments where the user need to press FIRE for the game to start."""
        super(FireResetEnv, self).__init__(env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def reset(self):
        self.env.reset()
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset()
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset()
        return obs

class ProcessFrame84(gym.ObservationWrapper):
    def __init__(self, env=None):
        super(ProcessFrame84, self).__init__(env)
        self.observation_space = spaces.Box(low=0, high=255, shape=(84, 84, 1))

    def observation(self, obs):
        return ProcessFrame84.process(obs)

    @staticmethod
    def process(frame):
        if frame.size == 210 * 160 * 3:
            img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
        elif frame.size == 250 * 160 * 3:
            img = np.reshape(frame, [250, 160, 3]).astype(np.float32)
        else:
            assert False, "Unknown resolution."
        img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
        resized_screen = cv2.resize(img, (84, 110), interpolation=cv2.INTER_AREA)
        x_t = resized_screen[18:102, :]
        x_t = np.reshape(x_t, [84, 84, 1])
        return x_t.astype(np.uint8)


class LazyFrames(object):
    def __init__(self, frames):
        """This object ensures that common frames between the observations are only stored once.
        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
        buffers.
        This object should only be converted to numpy array before being passed to the model.
        You'd not belive how complex the previous solution was."""
        self._frames = frames

    def __array__(self, dtype=None):
        out = np.concatenate(self._frames, axis=0)
        if dtype is not None:
            out = out.astype(dtype)
        return out


class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
        """Stack k last frames.
        Returns lazy array, which is much more memory efficient.
        See Also
        --------
        baselines.common.atari_wrappers.LazyFrames
        """
        gym.Wrapper.__init__(self, env)
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0]*k, shp[1], shp[2]))

    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info

    def _get_ob(self):
        assert len(self.frames) == self.k
        return LazyFrames(list(self.frames))


class ChannelsFirstImageShape(gym.ObservationWrapper):
    """
    Change image shape to CWH
    """
    def __init__(self, env):
        super(ChannelsFirstImageShape, self).__init__(env)
        old_shape = self.observation_space.shape
        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]))

    def observation(self, observation):
        return np.swapaxes(observation, 2, 0)


class MainGymWrapper():

    @staticmethod
    def wrap(env):
        env = NoopResetEnv(env, noop_max=30)
        if 'FIRE' in env.unwrapped.get_action_meanings():
            env = FireResetEnv(env)
        env = ProcessFrame84(env)
        env = ChannelsFirstImageShape(env)
        env = FrameStack(env, 4)
        return env

In [0]:

"""
Source: https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py
"""
import numpy as np
from collections import deque
import gym
from gym import spaces
import cv2

from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only

import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()


def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env_video(env):
  env = Monitor(env, './video', force=True)
  return env


class NoopResetEnv(gym.Wrapper):
    def __init__(self, env=None, noop_max=30):
        """Sample initial states by taking random number of no-ops on reset.
        No-op is assumed to be action 0.
        """
        super(NoopResetEnv, self).__init__(env)
        self.noop_max = noop_max
        self.override_num_noops = None
        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

    def reset(self):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset()
        if self.override_num_noops is not None:
            noops = self.override_num_noops
        else:
            noops = np.random.randint(1, self.noop_max + 1)
        assert noops > 0
        obs = None
        for _ in range(noops):
            obs, _, done, _ = self.env.step(0)
            if done:
                obs = self.env.reset()
        return obs


class FireResetEnv(gym.Wrapper):
    def __init__(self, env=None):
        """For environments where the user need to press FIRE for the game to start."""
        super(FireResetEnv, self).__init__(env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def reset(self):
        self.env.reset()
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset()
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset()
        return obs


class EpisodicLifeEnv(gym.Wrapper):
    def __init__(self, env=None):
        """Make end-of-life == end-of-episode, but only reset on true game over.
        Done by DeepMind for the DQN and co. since it helps value estimation.
        """
        super(EpisodicLifeEnv, self).__init__(env)
        self.lives = 0
        self.was_real_done = True
        self.was_real_reset = False

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.was_real_done = done
        # check current lives, make loss of life terminal,
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
            # for Qbert somtimes we stay in lives == 0 condtion for a few frames
            # so its important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
        return obs, reward, done, info

    def reset(self):
        """Reset only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.
        """
        if self.was_real_done:
            obs = self.env.reset()
            self.was_real_reset = True
        else:
            # no-op step to advance from terminal/lost life state
            obs, _, _, _ = self.env.step(0)
            self.was_real_reset = False
        self.lives = self.env.unwrapped.ale.lives()
        return obs


class ProcessFrame84(gym.ObservationWrapper):
    def __init__(self, env=None):
        super(ProcessFrame84, self).__init__(env)
        self.observation_space = spaces.Box(low=0, high=255, shape=(84, 84, 1))

    def observation(self, obs):
        return ProcessFrame84.process(obs)

    @staticmethod
    def process(frame):
        if frame.size == 210 * 160 * 3:
            img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
        elif frame.size == 250 * 160 * 3:
            img = np.reshape(frame, [250, 160, 3]).astype(np.float32)
        else:
            assert False, "Unknown resolution."
        img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
        resized_screen = cv2.resize(img, (84, 110), interpolation=cv2.INTER_AREA)
        x_t = resized_screen[18:102, :]
        x_t = np.reshape(x_t, [84, 84, 1])
        return x_t.astype(np.uint8)


class ClippedRewardsWrapper(gym.RewardWrapper):
    def reward(self, reward):
        """Change all the positive rewards to 1, negative to -1 and keep zero."""
        return np.sign(reward)


class LazyFrames(object):
    def __init__(self, frames):
        """This object ensures that common frames between the observations are only stored once.
        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
        buffers.
        This object should only be converted to numpy array before being passed to the model.
        You'd not belive how complex the previous solution was."""
        self._frames = frames

    def __array__(self, dtype=None):
        out = np.concatenate(self._frames, axis=0)
        if dtype is not None:
            out = out.astype(dtype)
        return out


class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
        """Stack k last frames.
        Returns lazy array, which is much more memory efficient.
        See Also
        --------
        baselines.common.atari_wrappers.LazyFrames
        """
        gym.Wrapper.__init__(self, env)
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0]*k, shp[1], shp[2]))

    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info

    def _get_ob(self):
        assert len(self.frames) == self.k
        return LazyFrames(list(self.frames))


class ChannelsFirstImageShape(gym.ObservationWrapper):
    """
    Change image shape to CWH
    """
    def __init__(self, env):
        super(ChannelsFirstImageShape, self).__init__(env)
        old_shape = self.observation_space.shape
        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]))

    def observation(self, observation):
        return np.swapaxes(observation, 2, 0)


class MainGymWrapper():

    @staticmethod
    def wrap(env):
        env = NoopResetEnv(env, noop_max=30)
        if 'FIRE' in env.unwrapped.get_action_meanings():
            env = FireResetEnv(env)
        env = ProcessFrame84(env)
        env = ChannelsFirstImageShape(env)
        env = FrameStack(env, 4)
        # env = ClippedRewardsWrapper(env)
        return env

**This is the very simple Deep CNN model we'll train using the small fragile framework memory reply pack**

In [0]:
import numpy as np
import os
import random
import shutil
from statistics import mean
import datetime

from tensorflow.python.keras.layers import Conv2D, Flatten, Dense
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.optimizers import RMSprop

BATCH_SIZE = 32


class ConvolutionalNeuralNetwork:

    def __init__(self, input_shape, action_space):
        self.model = Sequential()
        self.model.add(Conv2D(32,
                              8,
                              strides=(4, 4),
                              padding="valid",
                              activation="relu",
                              input_shape=input_shape,
                              data_format="channels_first"))
        self.model.add(Conv2D(64,
                              4,
                              strides=(2, 2),
                              padding="valid",
                              activation="relu",
                              input_shape=input_shape,
                              data_format="channels_first"))
        self.model.add(Conv2D(64,
                              3,
                              strides=(1, 1),
                              padding="valid",
                              activation="relu",
                              input_shape=input_shape,
                              data_format="channels_first"))
        self.model.add(Flatten())
        self.model.add(Dense(512, activation="relu"))
        self.model.add(Dense(action_space))
        self.model.compile(loss="mean_squared_error",
                           optimizer=RMSprop(lr=0.00025,
                                             rho=0.95,
                                             epsilon=0.01),
                           metrics=["accuracy"])
        self.model.summary()


class ModelTrainer():

    def __init__(self, game_name, input_shape, action_space):
        self.action_space = action_space
        self.model = ConvolutionalNeuralNetwork(input_shape, action_space).model
        self.memory = []

    def move(self, state):
        actions = self.model.predict(np.expand_dims(np.asarray(state).astype(np.float64), axis=0), batch_size=1)
        return np.argmax(actions[0])

    def remember(self, memory):
        self.memory = memory

    def step_update(self, total_step):
        self._train()

    def _train(self):
        batch = np.asarray(random.sample(self.memory, BATCH_SIZE))
        if len(batch) < BATCH_SIZE:
            return

        current_states = []
        values = []

        for entry in batch:
            current_state = np.expand_dims(np.asarray(entry["current_state"]).astype(np.float64), axis=0)
            current_states.append(current_state)
            q = np.zeros(self.action_space)
            q[entry["action"]] = 1
            values.append(q)

        fit = self.model.fit(np.asarray(current_states).squeeze(),
                            np.asarray(values).squeeze(),
                            epochs=500,
                            batch_size=BATCH_SIZE,
                            verbose=1)


**And finally here is the main code where we explore the game environment, generating a little pack of memory using the fragile framework to fit the neural network model using this reply memory data**

In [0]:
import gym
import argparse
import numpy as np
import atari_py
from IPython.display import clear_output
import time
from plangym import AtariEnvironment, ParallelEnvironment
from fragile.atari.env import AtariEnv

from fragile.core import DiscreteUniform, GaussianDt
from fragile.core.tree import HistoryTree
from fragile.core.swarm import Swarm

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

FRAMES_IN_OBSERVATION = 4
FRAME_SIZE = 84
INPUT_SHAPE = (FRAMES_IN_OBSERVATION, FRAME_SIZE, FRAME_SIZE)
MEMORY_SIZE = 900000
EXPLORE_MEMORY_STEPS = 5


class FragileRunner:
    def __init__(self, game_name):

        self.env = ParallelEnvironment(
            env_class=AtariEnvironment,
            name=game_name,
            clone_seeds=True,
            autoreset=True,
            blocking=False,
        )

        self.game_name = game_name
        self.env_callable = lambda: AtariEnv(env=self.env)
        self.dt = GaussianDt(min_dt=3, max_dt=1000, loc_dt=4, scale_dt=2)
        self.model_callable = lambda env: DiscreteUniform(
            env=self.env, critic=self.dt)
        self.prune_tree = True
        # A bigger number will increase the quality of the trajectories sampled.
        self.n_walkers = 16
        self.max_epochs = 1000  # Increase to sample longer games.
        self.reward_scale = 2  # Rewards are more important than diversity.
        self.distance_scale = 1
        self.minimize = False  # We want to get the maximum score possible.
        self.memory = []

    def run(self):

        swarm = Swarm(
            model=self.model_callable,
            env=self.env_callable,
            tree=HistoryTree,
            n_walkers=self.n_walkers,
            max_epochs=self.max_epochs,
            prune_tree=self.prune_tree,
            reward_scale=self.reward_scale,
            distance_scale=self.distance_scale,
            minimize=self.minimize,
        )

        env_name = self.game_name
        env = MainGymWrapper.wrap(gym.make(env_name))
        
        print("Creating fractal replay memory...")

        for i in range(EXPLORE_MEMORY_STEPS):

          try:
            _ = swarm.run(report_interval=1000)

            print("Max. fractal cum_rewards:", max(swarm.walkers.states.cum_rewards))

            best_ix = swarm.walkers.states.cum_rewards.argmax()
            best_id = swarm.walkers.states.id_walkers[best_ix]
            path = swarm.tree.get_branch(best_id, from_hash=True)
              
            current_state = env.reset()
            terminal = False
            reward = 0            
            for a in path[1]:    
                                  
                next_state, reward, terminal, _ = env.step(a)

                self.memory.append({"current_state": current_state, "action": a})
                
                current_state = next_state                 

                if len(self.memory) > MEMORY_SIZE:
                  self.memory.pop(0)   
                     
          except:
            pass

          print("Fractal replay memory size: ", len(self.memory))

        return self.memory


class FractalExplorationImitationLearning:

    def __init__(self):
        # We choose a game
        game_name = "SpaceInvaders"

        # Choose after how many runs we should stop
        total_run_limit = 100
        print("Selected game: " + str(game_name))        
        print("Total run limit: " + str(total_run_limit))
        
        env_name = game_name + "Deterministic-v4"
        env = wrap_env_video(MainGymWrapper.wrap(gym.make(env_name)))
        explorer = FragileRunner(env_name)
        
        # Game model
        game_model = ModelTrainer(env_name, INPUT_SHAPE, env.action_space.n)

        # model training
        self._main_loop(env_name, explorer, game_model, total_run_limit)

    def _main_loop(self, env_name, explorer, game_model, total_run_limit):
        run = 0
        while run < total_run_limit:
            run += 1            
            print("Training run:", run)                         

            # We explore the game space state using fragile framework  
            game_model.remember(explorer.run())

            # Training a run                       
            game_model.step_update(run)
            
            # Testing model
            clear_output()
            print("Testing Neural Network...")
            env = wrap_env_video(MainGymWrapper.wrap(gym.make(env_name)))
            terminal = False
            current_state = env.reset()
            score = 0
            while not terminal:                     
                action = game_model.move(current_state)
                next_state, reward, terminal, _ = env.step(action)
                score += reward
                current_state = next_state                
            env.close()
            
            print("Neural Network score:", score)
            show_video()   

if __name__ == "__main__":
    FractalExplorationImitationLearning()