### Balancing Cartpole using Cross Entropy Learning ###

In this project we will build our first smart RL agent which will learn to solve the Cartpole prolem. 

Before we setup everything, lets get a theoretical sense on what we are going to do in this project. Cross Entropy based RL is a really cool and simple way of solving small problems. What we do is try to make the agent sense some number of episodes (a series of observation and action till it reaches the terminal) and collect the reward. Then what we do is filter out all the episodes with a lower reward value than a threshold. Now we have a nice batch of episodes which have high rewards. Now we give these episodes as trainable instances to a Neural Network and ask it to predict the action to be taken. We attach a Cross Entropy Loss at the end to leverage supervised learning with the actions taken in the episodes. We do it over and over again and finally our NN learns how to balance a Cartpole on a plank.

In [1]:
## Importing the necessary libraries ##

import gym
import torch
import torch.nn as nn
from tensorboardX import SummaryWriter
import numpy as np
from collections import namedtuple

Now we have everything in set. 

What we first do is create our neural network class.

It will be a very small FullyConnected Neural Network with two dense layer and the output will have same dimension as the number of actions available.

In [2]:
## Setting our NN ##

class FullyConnected(nn.Module):
    '''
    Creates the FC network which acts as the agent.
    '''
    
    def __init__(self , observation_size , hidden_size , action_size):
        
        super().__init__()
        
        self.network = nn.Sequential(nn.Linear(observation_size , hidden_size) ,  
                                     #nn.BatchNorm1d(hidden_size) , 
                                     nn.ReLU() ,
                                     nn.Linear(hidden_size , action_size))
        
    def forward(self , x):
        
        return self.network(x)

Boom!! We have set up our NN agent.

Lets check it if its working correctly!!

In [3]:
## Testing the FullyConnected Class ##

test_fc = FullyConnected(observation_size = 4 , hidden_size = 128 , action_size = 8)

test_inp = torch.randn((32 , 4))

test_out = test_fc(test_inp)
print('The Output shape is :' , tuple(test_out.shape))
print('This must match will the value : (32 , 8)')

The Output shape is : (32, 8)
This must match will the value : (32 , 8)


Perfect!! Our Neural Network model is made perfectly.

Now its time we create an utility function which yields batches of episodes for the NN to train on.

We are going to store the episodes in a namedtuple Episodes, which will hold the total reward and the sequence of steps (each step is another named tuple EachEpisodeStep with values observations and actions).

In [4]:
## Creating the Episodes and the EachEpisodeStep named tuple ##

EachEpisodeStep = namedtuple('EachEpisodeStep' , field_names = ['observation' , 'action'])

Episodes = namedtuple('Episodes' , field_names = ['reward' , 'steps'])

With the initial setup, lets create our utility function to yield a batch.

In [5]:
## Creating the Datalader utility function ##

def dataloader(env , net , batch_size):
    
    batch = []
    total_reward = 0.
    episode_step = []
    obs = env.reset()
    softmax = nn.Softmax(dim = 1)
    
    while True:
        obs_tensor = torch.FloatTensor([obs])
        action_prob_tensor = softmax(net(obs_tensor))
        action_prob = action_prob_tensor.data.numpy()[0]
        
        action = np.random.choice(a = len(action_prob) , p = action_prob)
        
        next_obs , reward , is_done , _ = env.step(action)
        
        total_reward += reward
        
        episode_step.append(EachEpisodeStep(observation = obs , action = action))
        
        if is_done:
            
            batch.append(Episodes(reward = total_reward , steps = episode_step))
            
            total_reward = 0.
            
            episode_step = []
            
            next_obs = env.reset()
        
        if len(batch) == batch_size:
            
            yield batch
            
            batch = []
            
        obs = next_obs

Now we have created our dataloader, but the problem is we wont be using the entire dataloader batch but just the ones which are above a certain threshold of reward. SO, now we are going to create another utility function which is going to filter the dataloader batch and just send out the episodes which have a reward higher than a threshold. Here we are going to use the percentile functionality to calculate the threshold value.

In [6]:
## Filtering utility ##

def filter_batch(batch , percentile):
    '''
    Filters batch and returns only the 
    datapoints which are in the percentile.
    '''
    
    rewards = list(map(lambda r : r.reward , batch))
    
    threshold = np.percentile(rewards , percentile)
    
    mean_reward = float(np.mean(rewards))
    
    train_observation = []
    
    train_action = []
    
    for reward , steps in batch:
        
        if reward < threshold:
            continue
            
        train_observation.extend(map(lambda step : step.observation , steps))
        train_action.extend(map(lambda step : step.action , steps))
        
    train_observation_tensor = torch.FloatTensor(train_observation)
    train_action_tensor = torch.LongTensor(train_action)
    
    return train_observation_tensor , train_action_tensor , threshold , mean_reward

Done!!

We are all set to create the final loop to balance our cartpole.

In [7]:
## Creating the agent and the env which will play the entire game ##

env = gym.make('CartPole-v0')

observation_size = env.observation_space.shape[0]

action_size = env.action_space.n

net = FullyConnected(observation_size = observation_size , hidden_size = 128 , action_size = action_size)

criterion = nn.CrossEntropyLoss()

optim = torch.optim.Adam(net.parameters() , lr = 1e-2)

writer = SummaryWriter(comment = "-cartpole")

for iter_num , batch in enumerate(dataloader(env , net , 16)):
    
    train_obs , train_act , threshold , rew_mean = filter_batch(batch , 70)
    
    optim.zero_grad()
    
    pred_action = net(train_obs)
    
    loss = criterion(pred_action , train_act)
    
    loss.backward()
    
    optim.step()
    
    print('Iteration  {} : Loss = {:.3f} , reward_mean = {:.3f} , threshold = {:.3f}'.format(iter_num , 
                                                                                             loss.item() ,
                                                                                             rew_mean , 
                                                                                             threshold))
    
    writer.add_scalar('loss' , loss.item() , iter_num)
    writer.add_scalar('reward_mean' , rew_mean , iter_num)
    writer.add_scalar('threshold' , threshold , iter_num)
    
    if rew_mean > 199:
        print('Carpole solved!!')
        break
    
    writer.close()

Iteration  0 : Loss = 0.689 , reward_mean = 21.250 , threshold = 24.000


  f"The environment {id} is out of date. You should consider "
  "We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) "
  if sys.path[0] == '':


Iteration  1 : Loss = 0.678 , reward_mean = 28.438 , threshold = 33.500
Iteration  2 : Loss = 0.660 , reward_mean = 25.812 , threshold = 29.500
Iteration  3 : Loss = 0.660 , reward_mean = 34.938 , threshold = 41.500
Iteration  4 : Loss = 0.646 , reward_mean = 33.312 , threshold = 32.000
Iteration  5 : Loss = 0.653 , reward_mean = 35.875 , threshold = 45.500
Iteration  6 : Loss = 0.643 , reward_mean = 38.625 , threshold = 46.000
Iteration  7 : Loss = 0.631 , reward_mean = 52.062 , threshold = 58.000
Iteration  8 : Loss = 0.629 , reward_mean = 41.125 , threshold = 46.500
Iteration  9 : Loss = 0.612 , reward_mean = 54.500 , threshold = 64.500
Iteration  10 : Loss = 0.607 , reward_mean = 45.125 , threshold = 57.500
Iteration  11 : Loss = 0.614 , reward_mean = 45.062 , threshold = 50.500
Iteration  12 : Loss = 0.621 , reward_mean = 55.625 , threshold = 66.000
Iteration  13 : Loss = 0.596 , reward_mean = 61.125 , threshold = 69.500
Iteration  14 : Loss = 0.592 , reward_mean = 62.562 , thresh

Amazing. So we have solved the Cartpole balancing problem with Cross Entropy!!

Next up we will go into the details of Q-Learning.