# CartPole Pytorch
In this notebook we will discuss in detail the methods and code that go into ```cartpole_1.py```

In this specific notebook and corresponding script we will be using the raw **PyTorch** framework whereas in others we use **FastAI** framework along with various Convolutional Neural Archictures for fine-tunned models 



In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

from collections import namedtuple

In [None]:
# Our hidden layer will contain 128 neurons
HIDDEN_SIZE = 128

# We will have 16 episodes per batch
BATCH_SIZE = 16

# We will only pick top 30% of episodes
PERCENTILE = 70

In [None]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        """
        This model takes a single observation from the environment as an input vector and outputs a number for every action we can perform. 
        
        The output is a probability distribution over actions.
        
        ARGs:
            obs_size    : input dimensions
            hidden_size : hidden layer size
            n_actions   : output size
        """
        super(Net, self).__init__()
        
        # Our model
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )
        
    def forward(self, x):
        return self.net(x)

In [None]:
# HELPER CLASSES

# This is a single episode stored as total undiscounted reward and a collection of EpisodeStep
Episode = namedtuple('Episode', field_names=['reward', 'steps'])

# This will be used to represent one single step that our agent made in the episode, and it stores: observation from the environment and what action the agent completed
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

In [None]:
# Function that generates batches
def iterate_batches(env, net, batch_size):
    """
    ARGs:
        env: (Env class, instance from Gym Library). This is our environment
        net: Our neural network
        batch_size: count of episodes per batch
    """
    # batch will store a list of Episode instances
    batch = []
    episode_reward = 0.0
    episode_step = []
    
    # Initializing environment
    obs = env.reset()
    
    # softmax, will convert the network's output to a probability distribution of actions
    sm = nn.Softmax(dim=1)
    
    while True:
        """
        We will now take our observation and:
            1. convert to PyTorch Tensor = (1,4) vector
            2. run tensor through our NN and through our softmax to give us a distribution 
            3. Convert tensor into a NumPy array. 
        """
        obs_v = torch.Tensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        
        """
        We will now use our probability distribution of actions to obtain the actual action for the current step by: sampling this distribution using NumPy's function: random.choice().
        
        Then we will pass that action we choose to the environment to get our next observation, our reward, and indication whether episode is over or not. 
        """
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action)
        
        """
        Now reward will be added to the current episodes total reward, and our list of episode steps is extended with an (observation, action) pair. 
        """
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        
        """
        The following is for handling when our current episode is over:
            1. Append finalized episode to batch, saving the total reward and steps taken
            2. Reset our total reward
            3. Clean the list of steps
            4. Reset out environment & start over
        """
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            
            # Checking if batch is full
            if len(batch) == batch_size:
                yield batch # return
                batch = []
                
        # Assining an observation obtained from the environment to our current observation variable
        obs = next_obs

In [None]:
def filter_batch(batch, percentile):
    """
    This function calculates a boundary reward, which is used to filter elite spides to train on. We used NumPy's percentile function: 
    
        1. From the list of values and the desired percentile
        2. Calculates the perentiles value. 
        3. Then we calculate mean reward (used only for monitoring)
    """
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))
    
    """
    Now we will filter out our episodes. For every episode in the batch, we will check that the episode has a higher total reward then our boundary and if it has: we will populate lists of observations and actions that we will train on
    """
    train_obs = []
    train_act = []
    
    # Looping through batch
    for example in batch:
        # does it exceed our boundary?
        if example.reward < reward_bound:
            continue
        
        # It does! extend that obersvation
        train_obs.extend(map(lambda step: step.observation, example.steps))
        
        # It does! extend that action 
        train_act.extend(map(lambda step: step.action, example.steps))
        
    """
    Now we will convert out observations and actions from elite episodes into tensors, and return a Tuple of four:
        1. observation
        2. actions
        3. boundary of reward
        4. mean of reward
    """
    train_obs_v = torch.Tensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean

In [None]:
if __name__ == "__main__":
    # Creating our ennviroment
    env = gym.make('CartPole-v0')
    # Wrapping with a monitor - so we can observe the agent
    env = gym.Wrappers.Monitor(env, directory="Monitor", force=True)
    
    # Getting our input and output size for our NN
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n
    
    # Insantiating our NN
    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    
    # Loss function & Optimizer
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.01)
    
    # for tensorboardX
    writer = SummaryWriter()
    
    # Training
    """
    In this training loop: 
        1. We will iterate our batches (list of episodes
        2. Perfom filtering of the elite episodes using filter_batch function
        3. Get our: variables of observations and taken action, reward boundary used for filtering and the mean reward
        4. Then we zero gradients of our network
        5. Pass observations to the network obtaining its action scores
        6. Pass scores to the objective (loss) function, calculating our cross-entropy 
        
    This loop will REINFORCE our network to carry out those 'elite' actions which have led to good rewards. 
    
        7. Calculate gradients on the loss
        8. Ask optimizer to adjust network .step()
    """
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        
        # This is very similar to training a NN for image classification, sames steps 
        obs_v, act_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
        
        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        optimizer.step()
        
        # Printing & monitoring our progress
        print(f'{iter_no}: loss={loss_v.item()}, reward_mean={reward_m}, reward_bound={reward_b}')
        
        # for TensorBoard
        writer.add_scalar("loss", loss_v_item(), reward_b)
        writer.add_scalar("reward_bound", reward_b, iter_no)
        writer.add_scalar("reward_mean", reward_m, iter_no)
        
        """
        When our mean rewards of our batch episodes becomes greater than 199, we stop our training. 
        
        CartPole environment is considered to be solved when the mean reward for last 100 episodes is greater than 195
        """
        if reward_m > 199:
            print('Solved!')
            break
            
    writer.close()

# Putting It All Together
So if you still have confusion, here is what is essentially happening when you run ```cart_pole_pytorch.py```

After initializing our: Neural Network, Optimizer, Loss-Function, and Environment we enter into **inference** training. 

That is, we train our Neural Network at inference. As the Agent navigates and plays the game it is learning. *Close to how actual AGI will work*. 

In the very beginning our Agent's NN is initialized with random weights. Therefor it will perform badly, and will end an episodes very quickly. 

The first function that is called (called on loop) is ```iterate_batches()``` which will return a batch. Since we use ```enumerate``` it will return a batch & iteration number to keep track. 

In our example a batch is composed of **16 episodes**. Remember an episode is essentially an entire game played. 

We then take this batch (we process one batch at a time) and run it through ```filter_batch()``` which in our case will return **top 30%** episodes (observations and associated actions). 

We will then use what ```filter_batch()``` returns: ```obs_v, acts_v, reward_b, reward_m```. Most important we will use: ```obs_v``` as our input to our neural network. ```acts_v``` as the **true labels** to the input. 

So now that we have these variables, we will first run ```obs_v``` into our (at first = untrained neural network) to give us a prediction (softmax) of actions to take. 

Running ```obs_v``` into our neural network will return ```action_scores``` which is predicted actions to take. 

We will then use ```nn.CrossEntropyLoss(action_scores, acts_v)``` to compare: **predicted**: ```action_scores``` to **actual**: ```acts_v``` to calculate our loss. 

With our loss calculated we will do (as you usually do for NN training):
* calculate gradients: ```loss_v.backward()```
* call optimizer step: ```optimizer.step()```

Then we repeat! 

What happens is: every iteration is an iteration of **16 batches**. As the loss goes down, the agent last longer and keeps track of the obervations and actions. 

With gradient descent it will eventually learn the steps to take in order to achieve the goal! 