## Jupyter Notebook for Cross-Entorpy Reinforcement Learning.

Cross Entropy Method is a *Model-Free*, *Policy-based* and *On-policy*.

- It doesn't build any model of the environment.
- It approximates the policy of the agent.
- It train on fresh data obtained from the environment.

### Algorithm Steps:

1. Play N number of Episodes, using current model.
2. Calculate the total reward for every episode and decide on a reward cut-off boundary. 
3. Throw away all episodes with total reward below the boundary.
4. Train on the remaining episodes using observations as the input and issued actions as desired output.
5. Repeat from step 1. untill total reward per epsiode converges or until desired performance is reached.

In [1]:
import gym
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

from Agents.cross_entropy_agent import Cross_Entropy_Agent

HIDDEN_SIZE = 128
BATCH_SIZE = 16
CUT_OFF_PERCENTILE = 70



### Create Model Network.

In [2]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

### Agent-Environment Setup

In [3]:
environment = gym.make("CartPole-v0")

obs_size = environment.observation_space.shape[0]
n_actions = environment.action_space.n

network = Net(obs_size, HIDDEN_SIZE, n_actions)

agent = Cross_Entropy_Agent(environment=environment, network=network)

In [4]:
agent.run(BATCH_SIZE, CUT_OFF_PERCENTILE)

0: loss=0.685, reward_mean=20.8, reward_bound=20.0
1: loss=0.670, reward_mean=22.3, reward_bound=23.0
2: loss=0.673, reward_mean=19.1, reward_bound=22.5
3: loss=0.688, reward_mean=21.1, reward_bound=27.5
4: loss=0.684, reward_mean=25.2, reward_bound=27.0
5: loss=0.669, reward_mean=30.9, reward_bound=38.5
6: loss=0.681, reward_mean=37.4, reward_bound=44.5
7: loss=0.665, reward_mean=26.7, reward_bound=30.0
8: loss=0.652, reward_mean=40.0, reward_bound=45.0
9: loss=0.625, reward_mean=32.8, reward_bound=35.0
10: loss=0.641, reward_mean=44.0, reward_bound=46.0
11: loss=0.620, reward_mean=31.6, reward_bound=35.5
12: loss=0.634, reward_mean=41.5, reward_bound=47.0
13: loss=0.597, reward_mean=42.9, reward_bound=46.0
14: loss=0.609, reward_mean=54.1, reward_bound=66.0
15: loss=0.609, reward_mean=45.3, reward_bound=57.5
16: loss=0.615, reward_mean=53.6, reward_bound=58.0
17: loss=0.596, reward_mean=55.8, reward_bound=76.0
18: loss=0.594, reward_mean=60.4, reward_bound=64.0
19: loss=0.595, reward