From gym/README.rst 
==========

* There are two basic concepts in reinforcement learning: the environment (namely, the outside world) and the agent (namely, the algorithm you are writing). The agent sends actions to the environment, and the environment replies with observations and rewards (that is, a score).

* The core gym interface is Env, which is the unified environment interface. There is no interface for agents; that part is left to you. The following are the Env methods you should know:

  * reset(self): Reset the environment's state. Returns observation.
  * step(self, action): Step the environment by one timestep. Returns observation, reward, done, info.
  * render(self, mode='human', close=False): Render one frame of the environment. The default mode will do something human friendly, such as pop up a window. Passing the close flag signals the renderer to close any such windows.

From ALE manual:
========

* The action space consists of both Player A and Player B’s actions (see Section 7.1 for details). In general, Player B’s action may safely be set to noop (18) but it should be left out altogether if the restricted_action_set option is set to true.

* Available Actions: The following regular actions are defined in common/Constants.h and interpreted by ALE:
  * noop (0)  fire (1)  up (2)  right (3)  left (4) 
  * down (5)  up-right (6)  up-left (7)  down-right (8)  down-left (9)
  * up-fire (10)  right-fire (11)  left-fire (12)  down-fire (13)  up-right-fire (14)
  * up-left-fire (15)  down-right-fire (16)  down-left-fire (17)  reset* (40)

* Note that the reset (40) action toggles the Atari 2600 switch, rather than reset the environment, and as such is ignored by most interfaces. The shared library, CTypes, and fifo interfaces provide methods for properly resetting the environment.

* Player B’s actions are defined to be 18 + the equivalent action value for Player A. For example, Player B’s up action is up (20). In addition to the regular ALE actions, the following (somewhat deprecated) actions are also processed by the FIFO interfaces:

* The observation space depends on whether the send_rgb option is enabled. When enabled, the observation space consists of 100,928 integers: first the 128 bytes of RAM (taking values in 0–255), followed by 100,800 bytes describing the screen. Each pixel is described by three bytes, taking values in 0–255, specifying the pixel’s red, green and blue components in that order. The screen is provided in row-order, beginning with the 160 pixels that compose the first row. If send_rgb is disabled (this is the default), the observation space consists of 33,728 integers: first the 128 bytes of RAM, then the 33,600 screen pixels (in NTSC format; values in 0–127). These pixels are also provided in row order.



In [None]:
import numpy as np
from PIL import Image
import cProfile
import pstats

from dqn import AtariDQNAgent

In [None]:
agent = AtariDQNAgent()

In [None]:
agent.train(10000)

In [None]:
cProfile.run('agent.train(1000)', 'cprofile.out')

In [None]:
p = pstats.Stats('cprofile.out')
p.sort_stats('tottime').print_stats(20)

In [None]:
agent._replay_memory.index

In [None]:
import random
e = agent._replay_memory[random.randint(0, len(agent._replay_memory) - 1)]

In [None]:
e[1], e[2], e[4]

In [None]:
s = np.array(e[0][:,:,0], dtype=np.uint8)

In [None]:
Image.fromarray(s)

In [None]:
agent.train(100, save_path='checkpoints/Breakout-v0-50')

In [None]:
agent._replay_memory

In [None]:
import gym

In [None]:
breakout = gym.make('Breakout-v0')

In [None]:
#breakout.env.__init__(game='breakout', obs_type='image', frameskip=1, repeat_action_probability=0)

In [None]:
observation = breakout.reset()
state = None

In [None]:
Image.fromarray(observation)

In [None]:
observation.dtype

In [None]:
observation.shape

In [None]:
R = observation[:, :, 0]
G = observation[:, :, 1]
B = observation[:, :, 2]
phi = R * (299/1000) + G * (587/1000) + B * (114/1000)

In [None]:
phi.shape

In [None]:
Image.fromarray(np.array(phi, dtype=np.uint8))

In [None]:
phi = np.average(observation, weights=[0.299, 0.587, 0.114], axis=2)

In [None]:
state = agent.preprocess_observation(observation, state)

In [None]:
Image.fromarray(np.array(state[:, :, 0], dtype=np.uint8))

In [None]:
observation, reward, done, info = breakout.step(3)

In [None]:
breakout.action_space

In [None]:
atari_action = {
    0: 'noop',
    1: 'fire',
    2: 'up',
    3: 'right',
    4: 'left',
    5: 'down',
    11: 'right-fire',
    12: 'left-fire',
}

In [None]:
for i, action_value in enumerate(breakout.env._action_set):
    print('{}: {}'.format(i, atari_action[breakout.env._action_set[i]]))

In [None]:
observation, reward, done, info = breakout.step(0)
Image.fromarray(observation)

In [None]:
info, done

In [None]:
import random
steps = 0
breakout.reset()
observation, reward, done, info = breakout.step(0)
lives = info['ale.lives']
while done or steps < 400:
    observation, reward, done, info = breakout.step(random.randint(1, 2))
    steps += 1

In [None]:
Image.fromarray(observation)

In [None]:
breakout.render()

In [None]:
_ = breakout.step(0)

In [None]:
reward, done, info

In [None]:
print('{:g}'.format(2562692632018944.0))

In [None]:
arr = np.random.rand(3, 4)

In [None]:
arr

In [None]:
np.amax(arr, axis=1)

In [None]:
mask = np.array([1, 0, 1])

In [None]:
mask

In [None]:
arr * (1 - mask[:, np.newaxis])

In [None]:
mask.dot(arr)

In [None]:
mask

In [None]:
np.ma.array(arr, [True, False, True])