<a href="https://colab.research.google.com/github/asrjy/ldrl/blob/main/Chapter%205%20-%20The%20Cross-Entropy%20Method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Cross-Entropy Method

Simple method with good convergence. In simple environments that don't require complex, multistep policies, short episodes with frequent rewards, cross entropy works well. 

### The taxonomy of RL Methods

The cross-entropy method falls into the model-free and policy-based category of methods. 

Many ways to categorize RL methods. But most common are 

- Model-free or Model-based
- Value-based or Policy-based
- On-policy or Off-policy

Model-free means we don't build a model of the environment or reward. It takes current observatoins, does some computations on them and result is the action it should take. Easier to train as it's hard to build good models of complex environments with rich observations. 

In Model-based, it tries to predict what the next observation/nextreward will be and based on this prediction it chooses best possible action to take. Often looks more steps into the future. Often seen in deterministic environments, such as board games with strict rules. Only recently, people are combining both of these to get the best of both worlds.

Policy-based methods directly approximate the policy of agent ie., what actions agent should carry out at every step. Policy is represented by probability distribution over the available actions. 

Value-based methods are where the agent calculates the value of every possible action and chooses the action with the best value. Both of these methods are equally popular. 

Off-policy is the ability of the method to learn on historical data. On-policy requires fresh data to be obtained from the environment. 

Cross-Entropy is a model-free, policy based and on-policy method of Reinforcement Learning. 




### Cross-Entropy method in practice

Agent is the most trickiest part of Reinforcement Learning, where it tries to accumulate as much reward as possible. In practice, we replace all of the communication of agent with an ML approach with a non-linear trainable function, which maps the agent's inputs (observations) to some output. 

In cross-entropy method, a nonlinear function (neural network) produces the policy which tells the agent which action to take for each observation. 

In practice, the policy is a probability distribution over all actions, which is similar to a classification problem. 

So, in a sense, the observation passes from the environment to the neural netowrk which gives a probability distribution over actions, performs random sampling using the probability distribution to get an action carried out. This adds randomness to the agent which is a good thing because when the model is initialized, it has random weights. 

An agent's lifetime is represented using episodes. Each episode is a sequence of observations agent received from the environment, actions it has taken and the rewards for these actions. 

A discount factor, tells the method how much importance is given to future rewards. Discount factor of 1 means, it's just the sum of all local rewards for every episode. 

The core of cross-entropy is to throw away bad episodes and train on better ones. The core algorithm is as follows:

- Play N number of episodes using our current model and environment. 
- Calculate total reward for every episode and decide on a threshold. Usually set at 50th or 70th percentile of rewards. 
- Throw all episodes with rewards under the threshold set. 
- Train on remaining episodes using observations as input and issued actions as desired output. 
- Repeat above steps until satisfactory result is obtained. 

Cross-entropy is quite robust to hyperparameters changing, which makes it an ideal baseline method to try. 

### Cross-Entropy on CartPole

The NN is a one-hidden-layer NN with ReLU and 128 neurons. 

In [1]:
!pip install tensorboardX imageio-ffmpeg ffmpeg

Collecting tensorboardX
  Downloading tensorboardX-2.5-py2.py3-none-any.whl (125 kB)
Collecting imageio-ffmpeg
  Downloading imageio_ffmpeg-0.4.7-py3-none-win_amd64.whl (22.6 MB)
Collecting ffmpeg
  Downloading ffmpeg-1.4.tar.gz (5.1 kB)
Collecting protobuf>=3.8.0
  Downloading protobuf-3.20.1-cp39-cp39-win_amd64.whl (904 kB)
Collecting numpy
  Using cached numpy-1.22.3-cp39-cp39-win_amd64.whl (14.7 MB)
Building wheels for collected packages: ffmpeg
  Building wheel for ffmpeg (setup.py): started
  Building wheel for ffmpeg (setup.py): finished with status 'done'
  Created wheel for ffmpeg: filename=ffmpeg-1.4-py3-none-any.whl size=6084 sha256=769b83f566fcb3568fcbe1f1cd83f5b68d7fe1b114548c15ef3a013a63675bc6
  Stored in directory: c:\users\yash\appdata\local\pip\cache\wheels\1d\57\24\4eff6a03a9ea0e647568e8a5a0546cdf957e3cf005372c0245
Successfully built ffmpeg
Installing collected packages: protobuf, numpy, tensorboardX, imageio-ffmpeg, ffmpeg
Successfully installed ffmpeg-1.4 imageio-ff

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imageio 2.16.2 requires pillow>=8.3.2, which is not installed.
gym 0.15.4 requires scipy, which is not installed.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from collections import namedtuple
import gym
from tensorboardX import SummaryWriter
import numpy as np 

In [2]:
HIDDEN_SIZE = 128
BATCH_SIZE = 16 # number of episodes on each iteration
PERCENTILE = 70

In [3]:
class Net(torch.nn.Module):
  def __init__(self, obs_size, hidden_size, n_actions):
    super(Net, self).__init__()
    self.net = nn.Sequential(
        nn.Linear(obs_size, hidden_size),
        nn.ReLU(),
        nn.Linear(hidden_size, n_actions)
    )
  def forward(self, x):
    return self.net(x)

Using a nn.CrossEntropyLoss instead of using Softmax and then calculating Cross Entropy Loss. nn.CrossEntropyLoss requires raw unnormalized data, but we need to apply Softmax on the outputs to get the probabilities. 

In [4]:
Episode = namedtuple('Episode', field_names = ['reward', 'steps'])
EpisodeStep = namedtuple(
    'EpisodeStep', field_names = ['observation', 'action']
)

EpisodeStep will be used to represent on single step agent made in that episode and it stores the output of the step. 
Episode is a single episode stored as total undiscounted (gamma = 1) and collection of EpisodeSteps

In [5]:
def iterate_batches(env, net, batch_size):
  """
  This takes the gym environment, neural net and the batch size. 
  The Episode instances are stored in a list. 
  The reward is tracked. 
  The environment is reset to obtain the first observation and softmax layer is created to get the prob dists. 
  """
  batch = []
  episode_reward = 0.0
  episode_steps = []
  obs = env.reset()
  sm = nn.Softmax(dim = 1)
  while True:
    obs_v = torch.FloatTensor([obs]) # Because all nn.Module instances expect a float tensor. So converting observation to a tensor. 
    act_probs_v = sm(net(obs_v)) # Getting the probablities using softmax
    act_probs = act_probs_v.data.numpy()[0] # Data returned is a tensor. Unpacking it and getting data. 
    action = np.random.choice(len(act_probs), p = act_probs)
    next_obs, reward, is_done, _ = env.step(action)
    episode_reward += reward
    step = EpisodeStep(observation = obs, action = action)
    episode_steps.append(step)
    if is_done:
      e = Episode(reward = episode_reward, steps = episode_steps)
      batch.append(e)
      episode_reward = 0.0
      episode_steps = []
      next_obs = env.reset()
      if len(batch) == batch_size:
        yield batch
        batch = []
    obs = next_obs

In [6]:
def filter_batch(batch, percentile):
  """
  This function takes a batch of episodes, uses the percentile value to take the best episodes.
  """
  rewards = list(map(lambda s: s.reward, batch))
  reward_bound = np.percentile(rewards, percentile)
  reward_mean = float(np.mean(rewards))
  train_obs = []
  train_act = []
  for reward, steps in batch:
    if reward < reward_bound:
      continue
    train_obs.extend(map(lambda step: step.observation, steps))
    train_act.extend(map(lambda step: step.action, steps))
    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean

In [7]:
if __name__ == "__main__":
  env = gym.make("CartPole-v0")
  # env = gym.wrappers.Monitor(env, directory="mon", force=True)
  obs_size = env.observation_space.shape[0]
  n_actions = env.action_space.n
  net = Net(obs_size, HIDDEN_SIZE, n_actions)
  objective = nn.CrossEntropyLoss()
  optimizer = optim.Adam(params = net.parameters(), lr = 0.01)
  writer = SummaryWriter(comment = "-cartpole")
  for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    print(f"{iter_no} loss = {loss_v.item():.3f} reward_mean = {reward_m:.1f} reward_bound = {reward_b:.1f}")
    writer.add_scalar("loss", loss_v.item(), iter_no)
    writer.add_scalar("reward_bound", reward_b, iter_no)
    writer.add_scalar("reward_mean", reward_m, iter_no)
    if reward_m > 199:
      print("Solved!")
      break
    writer.close()

  obs_v = torch.FloatTensor([obs]) # Because all nn.Module instances expect a float tensor. So converting observation to a tensor.


0 loss = 0.705 reward_mean = 22.8 reward_bound = 24.5
1 loss = 0.671 reward_mean = 23.7 reward_bound = 23.5
2 loss = 0.657 reward_mean = 39.9 reward_bound = 30.5
3 loss = 0.660 reward_mean = 37.9 reward_bound = 34.5
4 loss = 0.651 reward_mean = 38.1 reward_bound = 44.0
5 loss = 0.625 reward_mean = 32.9 reward_bound = 35.0
6 loss = 0.636 reward_mean = 45.9 reward_bound = 64.0
7 loss = 0.629 reward_mean = 54.6 reward_bound = 52.0
8 loss = 0.670 reward_mean = 48.9 reward_bound = 44.5
9 loss = 0.602 reward_mean = 46.7 reward_bound = 53.0
10 loss = 0.602 reward_mean = 55.2 reward_bound = 59.0
11 loss = 0.618 reward_mean = 65.0 reward_bound = 77.0
12 loss = 0.594 reward_mean = 45.3 reward_bound = 46.0
13 loss = 0.598 reward_mean = 42.2 reward_bound = 45.0
14 loss = 0.585 reward_mean = 50.0 reward_bound = 48.0
15 loss = 0.559 reward_mean = 44.9 reward_bound = 50.0
16 loss = 0.595 reward_mean = 57.9 reward_bound = 59.0
17 loss = 0.589 reward_mean = 70.7 reward_bound = 80.5
18 loss = 0.554 rewa