# 3. Challenge - REINFORCE

![cartpole](https://gym.openai.com/videos/2019-04-06--My9IiAbqha/CartPole-v1/poster.jpg)

In this notebook, we're going to build a REINFORCE agent that learns to play a CartPole game. OpenAI (famously funded by Elon Musk) provides a package called "gym" that has a variety of environments where we can train our agents.

### Basic commands

We are aware that it might be your first interaction with OpenAI Gym library, therefore, we are providing you with some relevant commands you might need to use throughout both parts of the challenge.

In [None]:
def sample_commands():
    
    #Creating cartpole environment
    env = gym.make('CartPole-v0')
    
    #Visualize actions
    env.render()
    
    #Take action and extracting information
    observation, reward, done, info = env.step(action)
    
    #Close window
    env.close()
    
    #Extracting the number of possible actions
    env.action_space

### Implementation

Let's start building our agent by importing the required libraries and specifying parameters. You may play around with some of the values (gamma, learning rate, etc.) to see the effect it has on training.

In [None]:
import gym
import numpy as np
import tensorflow as tf
import keras
from keras.layers import Input, Dense
from keras.models import Sequential

In [None]:
#####------Configuration parameters----------------------################

seed = 42
# Discount factor for past rewards
gamma = ___
learning_rate = 0.01
max_steps_per_episode = 10000

In [None]:
#######------Setting up environment----------------------################

#Select Cartpole environment
env = ___
env.seed(seed)
eps = np.finfo(np.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0

### Define a model
Let's first build a model. It's a simple neural nets that takes in states as inputs and predicts which policy to follow. And that's it! You're soon going to see how we can train this simple model to predict the best policy in each state. 

In [None]:
num_states = env.observation_space.shape[0]
num_hiddens1 = 32
num_hiddens2 = 32
num_actions = ___

# define a model
inputs = Input(shape=(num_states,))
fc1 = Dense(num_hiddens1, activation='relu')(inputs)
fc2 = Dense(num_hiddens2, activation='relu')(fc1)
outputs = Dense(num_actions, activation='softmax')(fc2)

model = keras.Model(inputs=inputs, outputs=outputs)

### Define a training process

REINFORCE is one of the Policy Gradient algorithms that learns to find an optimal policy that maximises the objective function: 

$$
J(\theta) = \sum_\tau P(\tau; \theta)R(\tau)
$$

where $d_{\pi}(s)$ is probability of visiting the state $s$. Since it's nearly impossible to know the exact stationary distribution of states $d_{\pi}(s)$ especially in the continuous space, we'd like to approximate the objective function. In a nutshell, the approximation represents the expected rewards with a given policy. 

Its derivative that we will use to update our model is

$$
\begin {split}
\nabla J(\theta) 
& = \mathbb{E}_{\pi} [Q_{\pi}(s,a) {\nabla}_{\theta} \ln{\pi_\theta (a|s)}]
\\ &= \mathbb{E}_{\pi} [G_t {\nabla}_{\theta} \ln{\pi_\theta (a|s)}]
\end {split}
$$

according to the [Policy Gradient Theorem](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#policy-gradient-theorem).

To train a REINFORCE agent, we have to follow the following steps:

1. Initialize the policy with random values
2. Sample states, actions, and rewards from one episode
3. Update the policy using the gradient of the objective function $J(\theta)$
4. Repeat 2~3 until the policy converges

We just assume that the returns we get from each episode when following the current policy represents the objective function fairly well. By sampling them multiple times through thousands of episodes, we will hopefully update the model in the right direction. It's basically how most of the deep learning models work.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
action_probs_history = []  # probabilities of the selected actions for each time step
rewards_history = []  # rewards for each time step
running_reward = 0
episode_count = 0

while True:
    state = env.reset()  # reset the environment 
    episode_reward = 0
    
    # tf.GradientTape() lets us compute the gradiets of loss with respect to model weights
    with tf.GradientTape() as tape: 
        for timestep in range(1, max_steps_per_episode):
            env.render()
            
            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, axis=0)
            
            #Using model to output policy
            policy = ___
            
            action = np.random.choice(num_actions, 1, p=np.squeeze(policy))[0]
            action_prob = policy[0, action]  # probability of the action taken
            
            #Extracting environment parameters after the action
            state, reward, done, _ = ___
            
            # collect samples
            rewards_history += reward,  # appends a variable to the list
            action_probs_history += action_prob,
            
            episode_reward += reward
            
            if done:
                break
                
        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward
        
        # Calculate expected value from rewards
        # - At each timestep what was the total reward received after that timestep
        # - Rewards in the past are discounted by multiplying them with gamma
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)
            
        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # compute the loss
        
        action_probs_history = tf.convert_to_tensor(action_probs_history)
        cross_entropy = -tf.math.log(action_probs_history + 1e-6)
        loss = tf.reduce_sum(returns * cross_entropy)

        gradients = tape.gradient(loss, model.trainable_variables)  # compute the gradients of loss w.r.t model parameters
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))  # update the parameters accordingly
        action_probs_history, rewards_history = [], []  # clear the samples
        
    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break