## OpenAI Gym

### agent anatomy

In [4]:
import random


class Environment:
    def __init__(self):
        self.steps_left = 10

    def get_observation(self):
        return [0.0, 0.0, 0.0]

    def get_actions(self):
        return [0, 1]

    def is_done(self):
        return self.steps_left == 0

    def action(self, action):
        if self.is_done():
            raise Exception("Game is over")
        self.steps_left -= 1
        return random.random()


class Agent:
    def __init__(self):
        self.total_reward = 0.0

    def step(self, env):
        current_obs = env.get_observation()
        actions = env.get_actions()
        reward = env.action(random.choice(actions))
        self.total_reward += reward


if __name__ == "__main__":
    env = Environment()
    agent = Agent()

    while not env.is_done():
        agent.step(env)

    print("Total reward got: %.4f" % agent.total_reward)


Total reward got: 3.5296


### cartpole_random

In [6]:
import gym


if __name__ == "__main__":
    env = gym.make("CartPole-v0")

    total_reward = 0.0
    total_steps = 0
    obs = env.reset() #reset the env and obtain the first observation

    while True:
        action = env.action_space.sample()
        obs, reward, done, _ = env.step(action)
        total_reward += reward
        total_steps += 1
        if done:
            break

    print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))

Episode done in 42 steps, total reward 42.00


  result = entry_point.load(False)


### random_actionwrapper

In [23]:
import gym
import random

class RandomActionWrapper(gym.ActionWrapper):
    def __init__(self, env, epsilon=0.1):
        super(RandomActionWrapper, self).__init__(env)
        self.epsilon = epsilon #random ratio

    def action(self, action): #action wrapper
        if random.random() < self.epsilon:
            print("Random!")
            return self.env.action_space.sample()
        return action
    
    # def observation(self,obs):
    # def reward(self,rew):

if __name__ == "__main__":
    env = RandomActionWrapper(gym.make("CartPole-v0"))

    obs = env.reset()
    total_reward = 0.0

    while True:
        obs, reward, done, _ = env.step(0) #same action 0
        total_reward += reward
        if done:
            break

    print("Reward got: %.2f" % total_reward)

Random!
Random!
Reward got: 10.00


### cartpole_random_monitor

If encounter error `attributeerror 'imagedata' object has no attribute 'data' gym`

```bash
pip install pyglet==1.3.2
sudo apt-get install ffmpeg
```

will solve it.

In [14]:
import gym

if __name__ == "__main__":
    env = gym.make("CartPole-v0")
    env = gym.wrappers.Monitor(env, "recording",force=True)# in ./recording/

    total_reward = 0.0
    total_steps = 0
    obs = env.reset()

    while True:
        action = env.action_space.sample()
        obs, reward, done, _ = env.step(action)
        total_reward += reward
        total_steps += 1
        if done:
            break

    print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))
    env.close()
    env.env.close()

  result = entry_point.load(False)


Episode done in 12 steps, total reward 12.00


## Deep Learning with PyTorch

### modules

[nn.module](https://pytorch.org/docs/stable/nn.html#module)

All classes in the `torch.nn` packages inherit from the `nn.Module` base class.

Some useful methods:

- `parameters()`: A function that resturns iterator of all variables which require gradient computation(that is, module weights)
- `zero_grad()`: initializes all gradients of all parameters to 0
- `to(device)`: moves all module parameters to a given device(CPU/GPU)
- `state_dict()`: returns the dictionary with all module parameters and is useful for model serialization
- `load_state_dict()`: initializes the module with the state dictionary

In [21]:
import torch
import torch.nn as nn

class OurModule(nn.Module):
    def __init__(self, num_inputs, num_classes, dropout_prob=0.3):
        super(OurModule, self).__init__()
        self.pipe = nn.Sequential(
            nn.Linear(num_inputs, 5),
            #torch.nn.Linear(in_features, out_features, bias=True)
            nn.ReLU(),
            nn.Linear(5, 20),
            nn.ReLU(),
            nn.Linear(20, num_classes),
            nn.Dropout(p=dropout_prob),
            nn.Softmax(dim=1)
            #along dim=1,since dim=0 is batch samples
        )

    def forward(self, x):
        return self.pipe(x)

if __name__ == "__main__":
    net = OurModule(num_inputs=2, num_classes=3)
    print(net)
    v = torch.FloatTensor([[2, 3],[3,4],[4,5]])#dim=0: batch, dim=1,input
    out = net(v)
    print(out)
    print("Cuda's availability is %s" % torch.cuda.is_available())
    if torch.cuda.is_available():
        print("Data from cuda: %s" % out.to('cuda'))

OurModule(
  (pipe): Sequential(
    (0): Linear(in_features=2, out_features=5, bias=True)
    (1): ReLU()
    (2): Linear(in_features=5, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
    (5): Dropout(p=0.3)
    (6): Softmax()
  )
)
tensor([[0.4174, 0.4174, 0.1652],
        [0.2642, 0.6622, 0.0735],
        [0.2278, 0.7236, 0.0486]], grad_fn=<SoftmaxBackward>)
Cuda's availability is False


### tensorboard

In [22]:
import math
from tensorboardX import SummaryWriter


if __name__ == "__main__":
    writer = SummaryWriter()

    funcs = {"sin": math.sin, "cos": math.cos, "tan": math.tan}

    for angle in range(-360, 360):
        angle_rad = angle * math.pi / 180
        for name, fun in funcs.items():
            val = fun(angle_rad)
            writer.add_scalar(name, val, angle)

    writer.close()

then run 
```bash
tensorboard --logdir runs --host localhost
```

### atari_gan

In [28]:
#!/usr/bin/env python
import random
import argparse
import cv2

import torch
import torch.nn as nn
import torch.optim as optim
from tensorboardX import SummaryWriter

import torchvision.utils as vutils

import gym
import gym.spaces

import numpy as np

log = gym.logger
log.set_level(gym.logger.INFO)

LATENT_VECTOR_SIZE = 100
DISCR_FILTERS = 64
GENER_FILTERS = 64
BATCH_SIZE = 16

# dimension input image will be rescaled
IMAGE_SIZE = 64

LEARNING_RATE = 0.0001
REPORT_EVERY_ITER = 100
SAVE_IMAGE_EVERY_ITER = 1000


class InputWrapper(gym.ObservationWrapper):
    """
    Preprocessing of input numpy array:
    1. resize image into predefined size
    2. move color channel axis to a first place
    """
    def __init__(self, *args):
        super(InputWrapper, self).__init__(*args)
        assert isinstance(self.observation_space, gym.spaces.Box)
        old_space = self.observation_space
        self.observation_space = gym.spaces.Box(self.observation(old_space.low), self.observation(old_space.high),
                                                dtype=np.float32)

    def observation(self, observation):
        #1. resize image #(210,160) -> (64,64)
        new_obs = cv2.resize(observation, (IMAGE_SIZE, IMAGE_SIZE))
        #2. transform (210, 160, 3) -> (3, 210, 160) i.e.(channels,height,width)
        new_obs = np.moveaxis(new_obs, 2, 0)
        return new_obs.astype(np.float32)


class Discriminator(nn.Module):
    def __init__(self, input_shape):
        super(Discriminator, self).__init__()
        # this pipe converges image into the single number
        self.conv_pipe = nn.Sequential(
            nn.Conv2d(in_channels=input_shape[0], out_channels=DISCR_FILTERS,
                      kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS, out_channels=DISCR_FILTERS*2,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS*2),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 2, out_channels=DISCR_FILTERS * 4,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS * 4),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 4, out_channels=DISCR_FILTERS * 8,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS * 8),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 8, out_channels=1,
                      kernel_size=4, stride=1, padding=0),
            #real/fake probability
            nn.Sigmoid()
        )

    def forward(self, x):
        conv_out = self.conv_pipe(x)
        return conv_out.view(-1, 1).squeeze(dim=1)


class Generator(nn.Module):
    def __init__(self, output_shape):
        super(Generator, self).__init__()
        # pipe deconvolves input vector into (3, 64, 64) image
        self.pipe = nn.Sequential(
            #input: lantent vector
            #transposed convolution: deconvolution
            nn.ConvTranspose2d(in_channels=LATENT_VECTOR_SIZE, out_channels=GENER_FILTERS * 8,
                               kernel_size=4, stride=1, padding=0),
            nn.BatchNorm2d(GENER_FILTERS * 8),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 8, out_channels=GENER_FILTERS * 4,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 4),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 4, out_channels=GENER_FILTERS * 2,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 2),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 2, out_channels=GENER_FILTERS,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS, out_channels=output_shape[0],
                               kernel_size=4, stride=2, padding=1),
            nn.Tanh()
        )

    def forward(self, x):
        return self.pipe(x)


    #infinitely samples
def iterate_batches(envs, batch_size=BATCH_SIZE):
    batch = [e.reset() for e in envs]
    env_gen = iter(lambda: random.choice(envs), None)

    while True: 
        e = next(env_gen)
        obs, reward, is_done, _ = e.step(e.action_space.sample()) #play by random agent
        if np.mean(obs) > 0.01:
            batch.append(obs)
        if len(batch) == batch_size:
            # Normalising input between -1 to 1
            batch_np = np.array(batch, dtype=np.float32) * 2.0 / 255.0 - 1.0
            yield torch.tensor(batch_np,dtype=torch.float32)
            batch.clear()
        if is_done:
            e.reset()


if __name__ == "__main__":
    #parser = argparse.ArgumentParser()
    #parser.add_argument("--cuda", default=False, action='store_true', help="Enable cuda computation")
    #args = parser.parse_args()

    #device = torch.device("cuda" if args.cuda else "cpu")
    device=torch.device("cpu")
    envs = [InputWrapper(gym.make(name)) for name in ('Breakout-v0', 'AirRaid-v0', 'Pong-v0')]
    input_shape = envs[0].observation_space.shape
    
    #2 nets
    net_discr = Discriminator(input_shape=input_shape).to(device)
    net_gener = Generator(output_shape=input_shape).to(device)

    objective = nn.BCELoss()
    gen_optimizer = optim.Adam(params=net_gener.parameters(), lr=LEARNING_RATE, betas=(0.5, 0.999))
    dis_optimizer = optim.Adam(params=net_discr.parameters(), lr=LEARNING_RATE, betas=(0.5, 0.999))
    writer = SummaryWriter()

    gen_losses = []
    dis_losses = []
    iter_no = 0

    true_labels_v = torch.ones(BATCH_SIZE, dtype=torch.float32, device=device)
    fake_labels_v = torch.zeros(BATCH_SIZE, dtype=torch.float32, device=device)

    for batch_v in iterate_batches(envs):
        # generate extra fake samples, input is 4D: batch, filters, x, y
        gen_input_v = torch.FloatTensor(BATCH_SIZE, LATENT_VECTOR_SIZE, 1, 1).normal_(0, 1).to(device)
        batch_v = batch_v.to(device)
        gen_output_v = net_gener(gen_input_v)

        # train discriminator
        dis_optimizer.zero_grad()
        dis_output_true_v = net_discr(batch_v) #true data samples
        dis_output_fake_v = net_discr(gen_output_v.detach()) #generated samples,detach():a copy(), no gradient flow
        dis_loss = objective(dis_output_true_v, true_labels_v) + objective(dis_output_fake_v, fake_labels_v)
        dis_loss.backward()
        dis_optimizer.step()
        dis_losses.append(dis_loss.item())

        # train generator
        gen_optimizer.zero_grad()
        dis_output_v = net_discr(gen_output_v)
        gen_loss_v = objective(dis_output_v, true_labels_v)
        gen_loss_v.backward()
        gen_optimizer.step()
        gen_losses.append(gen_loss_v.item())

        #for tensorboard: tensorboard --logdir runs --host localhost
        iter_no += 1
        if iter_no % REPORT_EVERY_ITER == 0:
            log.info("Iter %d: gen_loss=%.3e, dis_loss=%.3e", iter_no, np.mean(gen_losses), np.mean(dis_losses))
            writer.add_scalar("gen_loss", np.mean(gen_losses), iter_no)
            writer.add_scalar("dis_loss", np.mean(dis_losses), iter_no)
            gen_losses = []
            dis_losses = []
        if iter_no % SAVE_IMAGE_EVERY_ITER == 0:
            writer.add_image("fake", vutils.make_grid(gen_output_v.data[:64], normalize=True), iter_no)
            writer.add_image("real", vutils.make_grid(batch_v.data[:64], normalize=True), iter_no)

INFO: Making new env: Breakout-v0
INFO: Making new env: AirRaid-v0


  result = entry_point.load(False)


INFO: Making new env: Pong-v0
INFO: Iter 100: gen_loss=4.538e+00, dis_loss=1.008e-01
INFO: Iter 200: gen_loss=6.244e+00, dis_loss=6.144e-03
INFO: Iter 300: gen_loss=6.966e+00, dis_loss=2.346e-03
INFO: Iter 400: gen_loss=7.283e+00, dis_loss=2.378e-02
INFO: Iter 500: gen_loss=6.235e+00, dis_loss=6.072e-02
INFO: Iter 600: gen_loss=5.509e+00, dis_loss=2.121e-01
INFO: Iter 700: gen_loss=5.555e+00, dis_loss=1.789e-02
INFO: Iter 800: gen_loss=6.426e+00, dis_loss=6.950e-03
INFO: Iter 900: gen_loss=6.289e+00, dis_loss=1.374e-01
INFO: Iter 1000: gen_loss=5.316e+00, dis_loss=2.340e-02


RuntimeError: cuda runtime error (30) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:74

```bash
tensorboard --logdir runs --host localhost
```

## The Cross-Entropy Method

1. sample $A_1,\cdots,A_N$ from $p(A)$
2. evaluate $J(A_1),\cdots,J(A_N)$
3. pick the **elites** $A_{i_1},\cdots,A_{i_M}$ with the highest value, where $M<N$
4. refit $p(A)$ to the elites $A_{i_1},\cdots,A_{i_M}$

### cartpole

In [47]:
import gym
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim


HIDDEN_SIZE = 128 #random
BATCH_SIZE = 16
PERCENTILE = 70 #only leave top 30% 


class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)# probability distribution over action
        )

    def forward(self, x):
        return self.net(x)

#helper
Episode = namedtuple('Episode', field_names=['reward', 'steps']) #for a single episode
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])#for a single step


def iterate_batches(env, net, batch_size):
    batch = [] # to accumulate batch
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1) #softmax layer
    while True:
        obs_v = torch.FloatTensor([obs])# (4) -> (1,4), add batch axis
        act_probs_v = sm(net(obs_v)) #raw action scores are fed to softmax function
        act_probs = act_probs_v.data.numpy()[0] #unpack tensors from gradient track
        action = np.random.choice(len(act_probs), p=act_probs) #int a=np.arange(len(...)),p=probability
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward #total reward of this episode
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))#an episode is over
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:#number of episode == batch_size
                yield batch
                batch = []
        obs = next_obs #current obs


def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward,  batch))# get every episode's reward
    reward_bound = np.percentile(rewards, percentile) #boundary reward, to filter elite episode
    reward_mean = float(np.mean(rewards))

    train_obs = [] # long long observations
    train_act = [] # long long actions
    for example in batch:
        if example.reward < reward_bound:
            continue # filter
        train_obs.extend(map(lambda step: step.observation, example.steps))
        train_act.extend(map(lambda step: step.action, example.steps))

    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean # the last two are used in TensorBoard


if __name__ == "__main__":
    env = gym.make("CartPole-v0")
    env = gym.wrappers.Monitor(env, directory="mon", force=True) #video
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.01)
    writer = SummaryWriter(comment="-cartpole")

    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v) # fit Net over action distribution
        loss_v.backward()
        optimizer.step()
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_bound", reward_b, iter_no)
        writer.add_scalar("reward_mean", reward_m, iter_no)
        if reward_m > 199:
            print("Solved!")
            break
    writer.close()

0: loss=0.694, reward_mean=20.2, reward_bound=22.0
1: loss=0.694, reward_mean=28.0, reward_bound=32.0
2: loss=0.687, reward_mean=24.2, reward_bound=29.0
3: loss=0.670, reward_mean=23.6, reward_bound=23.0
4: loss=0.661, reward_mean=27.9, reward_bound=32.5
5: loss=0.661, reward_mean=37.3, reward_bound=43.0
6: loss=0.658, reward_mean=43.8, reward_bound=55.0
7: loss=0.650, reward_mean=45.3, reward_bound=58.0
8: loss=0.629, reward_mean=30.8, reward_bound=33.5
9: loss=0.638, reward_mean=35.1, reward_bound=39.5
10: loss=0.629, reward_mean=45.6, reward_bound=54.0
11: loss=0.607, reward_mean=49.8, reward_bound=52.5
12: loss=0.623, reward_mean=42.6, reward_bound=46.0
13: loss=0.608, reward_mean=52.8, reward_bound=60.5
14: loss=0.621, reward_mean=49.2, reward_bound=60.5
15: loss=0.595, reward_mean=47.5, reward_bound=54.0
16: loss=0.591, reward_mean=48.5, reward_bound=55.5
17: loss=0.593, reward_mean=52.3, reward_bound=57.5
18: loss=0.580, reward_mean=62.2, reward_bound=67.5
19: loss=0.577, reward

### frozenlake naive

In [49]:
import gym
e=gym.make("FrozenLake-v0")
e.observation_space,e.action_space,e.reset(),e.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


(Discrete(16), Discrete(4), 0, None)

In [4]:
import gym, gym.spaces
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim


HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70


class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res


class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)


Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])


def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs


def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    train_obs = []
    train_act = []
    for example in batch:
        if example.reward < reward_bound:
            continue
        train_obs.extend(map(lambda step: step.observation, example.steps))
        train_act.extend(map(lambda step: step.action, example.steps))

    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean


if __name__ == "__main__":
    env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0"))
    env = gym.wrappers.Monitor(env, directory="mon", force=True)
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.01)
    writer = SummaryWriter(comment="-frozenlake-naive")

    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        optimizer.step()
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_bound", reward_b, iter_no)
        writer.add_scalar("reward_mean", reward_m, iter_no)
        if reward_m > 0.8:
            print("Solved!")
            break
    writer.close()

0: loss=1.399, reward_mean=0.0, reward_bound=0.0
1: loss=1.373, reward_mean=0.0, reward_bound=0.0
2: loss=1.350, reward_mean=0.2, reward_bound=0.0
3: loss=1.337, reward_mean=0.0, reward_bound=0.0
4: loss=1.332, reward_mean=0.0, reward_bound=0.0
5: loss=1.333, reward_mean=0.0, reward_bound=0.0
6: loss=1.286, reward_mean=0.0, reward_bound=0.0
7: loss=1.290, reward_mean=0.0, reward_bound=0.0
8: loss=1.295, reward_mean=0.0, reward_bound=0.0
9: loss=1.261, reward_mean=0.0, reward_bound=0.0
10: loss=1.321, reward_mean=0.0, reward_bound=0.0
11: loss=1.245, reward_mean=0.0, reward_bound=0.0
12: loss=1.323, reward_mean=0.0, reward_bound=0.0
13: loss=1.265, reward_mean=0.0, reward_bound=0.0
14: loss=1.282, reward_mean=0.0, reward_bound=0.0
15: loss=1.229, reward_mean=0.0, reward_bound=0.0
16: loss=1.129, reward_mean=0.0, reward_bound=0.0
17: loss=1.256, reward_mean=0.0, reward_bound=0.0
18: loss=1.163, reward_mean=0.0, reward_bound=0.0
19: loss=1.181, reward_mean=0.0, reward_bound=0.0
20: loss=1

KeyboardInterrupt: 

In CartPole, every step of the environment gives us the
reward 1.0, until the moment that the pole falls. So, the longer our agent
balanced the pole, the more reward it obtained. Due to randomness in our agent's
behavior, different episodes were of different lengths, which gave us a pretty
normal distribution of the episodes' rewards. After choosing a reward boundary,
we rejected less successful episodes and learned how to repeat better ones (by
training on successful episodes' data).

In the FrozenLake environment, episodes and their reward look different. We get
the reward of 1.0 only when we reach the goal, and this reward says nothing
about how good each episode was. Was it quick and efficient or did we make
four rounds on the lake before we randomly stepped into the final cell? We don't
know, it's just 1.0 reward and that's it. The distribution of rewards for our
episodes are also problematic. There are only two kinds of episodes possible,
with zero reward (failed) and one reward (successful), and failed episodes will
obviously dominate in the beginning of the training. So, our percentile selection
of "elite" episodes is totally wrong and gives us bad examples to train on. This is
the reason for our training failure.

### frozenlake tweaked

- **Larger batches of played episodes**: FrozenLake requires at least 100 just to get some successful episodes.
- **Discount factor applied to reward**: To make the total reward for the episode depend on episode length, and add variety in episodes, In this case, the reward for shorter episodes will be higher than the reward for longer ones
- **Keeping "elite" episodes for a longer time**:In FrozenLake, a successful episode is a much rarer animal, so we need to keep them for several iterations to train on them.
- **Decrease learning rate**: This will give our network time to average more training samples.
- **Much longer training time**: Due to the sparsity of successful episodes, and the random outcome of our actions, it's much harder for our network to get an idea of the best behavior to perform in any particular situation. To reach 50% successful episodes, about 5k training iterations are required.

In [6]:
import random
import gym
import gym.spaces
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim


HIDDEN_SIZE = 128
BATCH_SIZE = 100 #Larger batches of played episodes
PERCENTILE = 30
GAMMA = 0.9 #Discount factor applied to reward


class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res


class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)


Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])


def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs


def filter_batch(batch, percentile):
    disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch)) #discount factor
    reward_bound = np.percentile(disc_rewards, percentile)

    train_obs = []
    train_act = []
    elite_batch = []
    for example, discounted_reward in zip(batch, disc_rewards):
        if discounted_reward > reward_bound:
            train_obs.extend(map(lambda step: step.observation, example.steps))
            train_act.extend(map(lambda step: step.action, example.steps))
            elite_batch.append(example)

    return elite_batch, train_obs, train_act, reward_bound # elite_batch


if __name__ == "__main__":
    random.seed(12345)
    env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0"))
    # env = gym.wrappers.Monitor(env, directory="mon", force=True)
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.001) #the learning rate decreased 10 times
    writer = SummaryWriter(comment="-frozenlake-tweaked")

    full_batch = []
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))
        # use previous "elite" episodes 
        full_batch, obs, acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE)
        if not full_batch:
            continue
        obs_v = torch.FloatTensor(obs)
        acts_v = torch.LongTensor(acts)
        full_batch = full_batch[-500:]

        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        optimizer.step()
        if iter_no%100==0:
            print("%d: loss=%.3f, reward_mean=%.3f, reward_bound=%.3f, batch=%d" % (
            iter_no, loss_v.item(), reward_mean, reward_bound, len(full_batch)))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_mean", reward_mean, iter_no)
        writer.add_scalar("reward_bound", reward_bound, iter_no)
        if reward_mean > 0.8:
            print("Solved!")
            break
    writer.close()

0: loss=1.371, reward_mean=0.010, reward_bound=0.000, batch=1
100: loss=1.280, reward_mean=0.040, reward_bound=0.000, batch=181
200: loss=1.165, reward_mean=0.070, reward_bound=0.254, batch=225
300: loss=1.075, reward_mean=0.040, reward_bound=0.277, batch=229
400: loss=1.071, reward_mean=0.060, reward_bound=0.000, batch=139
500: loss=1.024, reward_mean=0.090, reward_bound=0.183, batch=227
600: loss=1.011, reward_mean=0.060, reward_bound=0.045, batch=221
700: loss=1.033, reward_mean=0.140, reward_bound=0.314, batch=220
800: loss=1.021, reward_mean=0.030, reward_bound=0.229, batch=229
900: loss=0.783, reward_mean=0.170, reward_bound=0.185, batch=217
1000: loss=0.753, reward_mean=0.130, reward_bound=0.349, batch=226
1100: loss=0.763, reward_mean=0.110, reward_bound=0.080, batch=219
1200: loss=0.728, reward_mean=0.200, reward_bound=0.351, batch=228
1300: loss=0.719, reward_mean=0.160, reward_bound=0.254, batch=226
1400: loss=0.722, reward_mean=0.110, reward_bound=0.405, batch=230
1500: los

KeyboardInterrupt: 

### frozenlake nonslippery

In [28]:
import random
import gym
import gym.spaces
import gym.wrappers
import gym.envs.toy_text.frozen_lake
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim


HIDDEN_SIZE = 128
BATCH_SIZE = 100
PERCENTILE = 30
GAMMA = 0.9


class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res


class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)


Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])


def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs


def filter_batch(batch, percentile):
    disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch))
    reward_bound = np.percentile(disc_rewards, percentile)

    train_obs = []
    train_act = []
    elite_batch = []
    for example, discounted_reward in zip(batch, disc_rewards):
        if discounted_reward > reward_bound:
            train_obs.extend(map(lambda step: step.observation, example.steps))
            train_act.extend(map(lambda step: step.action, example.steps))
            elite_batch.append(example)

    return elite_batch, train_obs, train_act, reward_bound


if __name__ == "__main__":
    random.seed(12345)
    env = gym.envs.toy_text.frozen_lake.FrozenLakeEnv(is_slippery=False)
    #env = gym.wrappers.TimeLimit(env, max_episode_steps=100) # env.spec is NoneType,error
    env = DiscreteOneHotWrapper(env)
    env = gym.wrappers.Monitor(env, directory="mon", force=True)
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.001)
    writer = SummaryWriter(comment="-frozenlake-nonslippery")

    full_batch = []
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))
        full_batch, obs, acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE)
        if not full_batch:
            continue
        obs_v = torch.FloatTensor(obs)
        acts_v = torch.LongTensor(acts)
        full_batch = full_batch[-500:]

        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        optimizer.step()
        if iter_no%10==0:
            print("%d: loss=%.3f, reward_mean=%.3f, reward_bound=%.3f, batch=%d" % (
            iter_no, loss_v.item(), reward_mean, reward_bound, len(full_batch)))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_mean", reward_mean, iter_no)
        writer.add_scalar("reward_bound", reward_bound, iter_no)
        if reward_mean > 0.8:
            print("Solved!")
            break
    writer.close()



10: loss=1.358, reward_mean=0.040, reward_bound=0.000, batch=16
20: loss=1.308, reward_mean=0.020, reward_bound=0.000, batch=34
30: loss=1.265, reward_mean=0.070, reward_bound=0.000, batch=71
40: loss=1.228, reward_mean=0.060, reward_bound=0.000, batch=140
50: loss=1.171, reward_mean=0.120, reward_bound=0.122, batch=220
60: loss=1.013, reward_mean=0.190, reward_bound=0.314, batch=205
70: loss=0.849, reward_mean=0.240, reward_bound=0.387, batch=213
80: loss=0.690, reward_mean=0.330, reward_bound=0.328, batch=181
90: loss=0.696, reward_mean=0.440, reward_bound=0.000, batch=86
100: loss=0.557, reward_mean=0.560, reward_bound=0.387, batch=135
110: loss=0.404, reward_mean=0.600, reward_bound=0.430, batch=99
Solved!


### Theoretical background of the cross-entropy method

The basis of the cross-entropy method lies in the importance sampling theorem, which states this:
$$
\mathbb{E}_{x\sim p(x)}[H(x)]=\mathbb{E}_{x\sim q(x)}\left[\frac{p(x)}{q(x)}H(x)  \right]
$$
In our RL case, $H(x)$ is a reward value obtained by some policy $x$ and $p(x)$ is a distribution of all possible policies. We don't want to maximize our reward by searching all possible policies, instead we want to find a way to approximate $p(x)H(x)$ by $q(x)$, iteratively minimizing the distance between them. The distance between two probability distributions is calculated by KL-divergence which is as follows:

$$
KL(p_1(x)||p_2(x))=\mathbb{E}_{x\sim p_1(x)}\log \frac{p_1(x)}{p_2(x)} = \mathbb{E}_{x\sim p_1(x)}{[\log p_1(x)]}-\mathbb{E}_{x\sim p_1(x)}{[\log p_2(x)]}
$$

The first term in KL is called **entropy** and doesn't depend on that, so could be
omitted during the minimization. The second term is called **cross-entropy** and is
a very common optimization objective in DL.

Combining both formulas, we can get an iterative algorithm, which starts with $q_0(x)=p(x)$ and on every step improves. This is an approximation of $p(x)H(x)$ with an update:

$$
q_{i+1}(x)=\arg\min_{q_{i+1}(x)}-\mathbb{E}_{x\sim q_i (x)}\frac{p(x)}{q_i(x)}H(x)\log q_{i+1}(x)
$$

This is a generic cross-entropy method, which can be significantly simplified in
our RL case. Firstly, we replace our $H(x)$ with an indicator function, which is 1
when the reward for the episode is above the threshold and 0 if the reward is
below. Our policy update will look like this:

$$
\pi_{i+1}(a|s)=\arg\min_{\pi_{i+1}(a|s)}-\mathbb{E}_{z\sim \pi_{i+1}(a|s)}[R(z)\ge \psi_i]\log \pi_{i+1}(a|s)
$$

Strictly speaking, the preceding formula misses the normalization term, but it
still works in practice without it. So, the method is quite clear: we sample
episodes using our current policy (starting with some random initial policy) and
minimize the negative log likelihood of the most successful samples and our
policy.

There is a whole book dedicated to this method, written by Dirk P. Kroese. A
shorter description can be found in the Cross-Entropy Method paper by Dirk
P.Kroese (https://people.smp.uq.edu.au/DirkKroese/ps/eormsCE.pdf).

## Tabular Learning and the Bellman Equation

## Deep Q-Networks

## DQN Extensions

## Stocks Trading Using RL

## Policy Gradients -- An Alternative

## The Actor-Critic Method

## Asynchronous Advantage Actor-Critic

## Chatbots Training with RL

## Web Navigation

## Continuous Action Space

## Trust Regions -- TRPO,PPO and ACKTR

## Black-Box Optimization in RL

## Beyond Model-Free -- Imagination

## AlphaGo Zero