In [0]:
#http://inoryy.com/post/tensorflow2-deep-reinforcement-learning/

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

# TensorFlow and tf.keras
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np

TensorFlow 2.x selected.


In [0]:
import gym
import logging
import matplotlib.pyplot as plt
import tensorflow.keras.layers as kl
import tensorflow.keras.losses as kls
import tensorflow.keras.optimizers as ko

#Reinforcement Learning theory
Generally speaking, reinforcement learning is a high-level framework for solving sequential decision-making problems. An RL agent navigates an environment by taking actions based on some observations, receiving rewards as a result. Most RL algorithms work by maximizing the expected total rewards an agent collects in a trajectory, e.g., during one in-game round.

The output of an RL algorithm is a policy – a function from states to actions.
A valid policy can be as simple as a hard-coded no-op action, but typically it represents a conditional probability distribution of actions given some state.

RL algorithms are often grouped based on their optimization loss function.

<ul>
  <li>
  Temporal-Difference methods, such as Q-Learning, reduce the error between predicted and actual state(-action) values.
  </li>
    <li>
  Policy Gradients directly optimize the policy by adjusting its parameters. Calculating gradients themselves is usually infeasible; instead, they are often estimated via monte-carlo methods.
  </li>
  <li>
  The most popular approach is a hybrid of the two: actor-critic methods, where policy gradients optimize agent’s policy, and the temporal-difference method is used as a bootstrap for the expected value estimates.
  </li>



</ul>

#Deep Reinforcement Learning theory

An RL algorithm is considered deep if the policy and value functions are approximated with neural networks.

<ul>
  <li>
  (Asynchronous) Advantage Actor-Critic:

  1. First, gradients are weighted with returns: a discounted sum of future rewards, which resolves theoretical issues with infinite timesteps, and mitigaes the "credit assignment problem" - allocate rewards to the correct actions
  2.An advantage function is used instead of raw returns. Advantage is formed as the difference between returns and some baseline, which is often the value estimate;
```
 Advantage = returns - baseline
```
and can be tought of as a measure of how good a given action is compared to some average.
  
  3. An additional entropy maximization term is used in the objective function to ensure the agent sufficiently explores various policies. In essence entropy measures how random a given probability distribution is. (For example, entropy is highest in the uniform distribution.)

  4. Finally multiple workers are used in parallel to speed up sample gathering while helping decorrelate them during training, diversifying the experiences an agent trains on in a given batch.
  </li>
</ul>

#Advantage Actor-Critic With TensorFlow 2.1

Now that we are more or less on the same page, let’s see what it takes to implement the basis of many modern DRL algorithms: an actor-critic agent, described in the previous section. Without parallel workers (for simplicity), though most of the code would be the same.

We use the Cartpole-v0 environment as a testbed.

#First, we create the policy and value estimate NNs under a single model class:

In [0]:
class ProbabilityDistribution(tf.keras.Model):
  def call(self, logits, **kwargs):
    # Sample a random categorical action from the given logits.
    return tf.squeeze(tf.random.categorical(logits,1), axis = -1)
    #tf squeeze: https://www.tensorflow.org/api_docs/python/tf/squeeze
      #returns a tensor of the same type with all dimensions of size 1 removed.
    #tf random categorical: https://www.tensorflow.org/api_docs/python/tf/random/categorical
      #RETURNS The drawn samples of shape [batch_size, num_samples].
      #I think you give it probabilities and size and it returns a batch of random samples from there.

In [0]:
class Model(tf.keras.Model): #what the model is: https://www.tensorflow.org/api_docs/python/tf/keras/Model
   #I think this gets tf.keras.Model and modifies it
  def __init__(self, num_actions):
    super().__init__("mlp_policy") #mlp_policy is probably just a name
    self.hidden1 = kl.Dense(128, activation = 'relu') #I dont think we are adding layers, but creating it?
    self.hidden2 = kl.Dense(64, activation= 'relu')
    self.value = kl.Dense(1, name = "value") #the 1 is the dimensionality of the output
    # Logits are unnormalized log probabilities.
    self.logits = kl.Dense(num_actions, name = "policy_logits")
    self.dist = ProbabilityDistribution()
  
  def call(self,inputs, **kwargs):
    # Inputs is a numpy array, convert to a tensor.
    print("Call start ","inputs: ", inputs)
    x = tf.convert_to_tensor(inputs)
    print("x converted to tensor: ", x)
    hidden_logs = self.hidden1(x) #This are two different networks?
    print("got hidden logs, ", hidden_logs)
    hidden_vals = self.hidden2(x)
    print("got hidden vals, ", hidden_vals)
    print("The logits are: ", self.logits(hidden_logs))
    print("The values are: ", self.value(hidden_vals))

    return self.logits(hidden_logs), self.value(hidden_vals)
  
  def action_value(self,obs): #Le estamos pasando solo la obs y las clases ya definidas
    # Executes `call()` under the hood.
    print("Action")
    logits, value = self.predict_on_batch(obs) 
    print("Back to action_value")
    print("After predict_on_batch: got logits", logits)
    print("After predict_on_batch: got value", value)
    action = self.dist.predict_on_batch(logits) #dist es probability distribution. You get it from the logits
    return np.squeeze(action, axis=-1), np.squeeze(value, axis=-1)

And verify the model works as expected:
(plus some experiments)

In [0]:
#simply making the environment
env = gym.make('CartPole-v0')

In [7]:
#gets the action space from the environment
print(env.action_space)
print(env.action_space.n)

Discrete(2)
2


In [8]:
#gets the model we made, 
model = Model(num_actions=env.action_space.n)
model

<__main__.Model at 0x7f9d185d7e48>

In [9]:
#Gets the observation
obs = env.reset()
print(obs.shape)
obs

(4,)


array([-0.02788602,  0.00177448,  0.04136009, -0.0131044 ])

In [10]:
print(obs[None, :].shape) #la forma de este es (1,4), en vez de (4,)
#Returns an array but it has double brackets,
obs[None] #I think its like adding the required batch
#Note obs[None] is the same as obs[None:]

(1, 4)


array([[-0.02788602,  0.00177448,  0.04136009, -0.0131044 ]])

Lo que pasa cuando usas model.action_value:

1. Se abre comienza a ejecutar la definción action_value.. ,que es self.predict_on_batch(), eso lo pasa a call()

2. Toda la función call() se ejecuta antes de seguir con action_value
<ul>
  <li>
    1. se transforma el input a un tensor. Cabe decir que call corre dos veces, al parecer hay dos inputs. Uno de shape (None, 4) otro de shape (1,4). El shape (1,4) es el obs[None, :]. El (None, 4) no estoy 100% seguro, PERO creo que son las 4 frames que juntamos. Como llegaron ahí, no lo se. Es posible que sean parte de obs que no se como ver.
  </li>
    <li>
    2. El input (None,4) , ya transformado se pasa a las redes neuronales hidden1 y hidden2, una te regresa logits, shape = (None,2) y el otro te regresa values shape = (None, 1). Estas shapes son así porque en la clase le indicamos para hidden1 que la red final es una Dense con un output de num_actions, y para hidden2 es un output de Dense(1)
  </li>
  <li>
    3. Al parecer estos datos los regresa, (porque al final de la función tiene un return()) pero no se donde se usan despues.
  </li>
  <li>
    4. Vuelve a empezar el call(), pero esta vez su input es shape (1,4), pasan los mismos pasos, pero ahora nuestros outputs de las Dense, tienen shape (1,2) y (1,1) que son lo mismo, pero sin el None al inicio.
  </li>
</ul>

3. Ahora regresamos a action_value()
<ul>
  <li>
    1. Usa predict_on_batch con obs (que es obs[None, :] de input, creo que como el modelo tiene dos redes y dos outputs, predict on batch te regresa dos valores, el value y el logits. (Voy a hacer un experimento, en el que modificaré una clase de red a ver si puedo hacer que me de outputs o cosas rarash (mas outputs, un input, cosas así) )
  </li>
</ul>

In [11]:
#returns the action we take (from the two availiable), pretty simple, either 1 or 0
#And the value, but what does the value mean?
action, value = model.action_value(obs[None, :])
print(action, value) # [1] [-0.00145713]

Action
Call start  inputs:  Tensor("input_1:0", shape=(None, 4), dtype=float32)
x converted to tensor:  Tensor("input_1:0", shape=(None, 4), dtype=float32)
got hidden logs,  Tensor("model/dense/Identity:0", shape=(None, 128), dtype=float32)
got hidden vals,  Tensor("model/dense_1/Identity:0", shape=(None, 64), dtype=float32)
The logits are:  Tensor("model/policy_logits/Identity:0", shape=(None, 2), dtype=float32)
The values are:  Tensor("model/value/Identity:0", shape=(None, 1), dtype=float32)
Call start  inputs:  Tensor("self:0", shape=(1, 4), dtype=float32)
x converted to tensor:  Tensor("self:0", shape=(1, 4), dtype=float32)
got hidden logs,  Tensor("model/dense/Relu:0", shape=(1, 128), dtype=float32)
got hidden vals,  Tensor("model/dense_1/Relu:0", shape=(1, 64), dtype=float32)
The logits are:  Tensor("model/policy_logits/BiasAdd:0", shape=(1, 2), dtype=float32)
The values are:  Tensor("model/value/BiasAdd:0", shape=(1, 1), dtype=float32)
Back to action_value
After predict_on_batch

#Esto es un espacio de experimentación, en el que modificaré una clase de un modelo.

Un modelo como el siguiente, lo dejé solo con una sola red, y un solo output. Y por lo tanto me da un solo valor de action_value

In [0]:
class ex2Model(tf.keras.Model): #what the model is: https://www.tensorflow.org/api_docs/python/tf/keras/Model
   #I think this gets tf.keras.Model and modifies it
  def __init__(self, num_actions):
    super().__init__() #mlp_policy is probably just a name
    print("__init__ started")
    self.hidden1 = kl.Dense(128, activation = 'relu') #I dont think we are adding layers, but creating it?
    self.value = kl.Dense(1, name = "value") #the 1 is the dimensionality of the output
    print("__init__ finished")

  def call(self,inputs, **kwargs): #the call function is something you change from the model
    # Inputs is a numpy array, convert to a tensor.
    print("call inputs: ", inputs)
    x = tf.convert_to_tensor(inputs)
    print("Converted inputs to tensor ", x)
    xThroughHidden1 = self.hidden1(x) #This are two different networks?
    print("passed tensorized inputs through the hidden layer ",xThroughHidden1)
    print("pass the results of the hidden layer 1, through a dense layer with output 1: ", self.value(xThroughHidden1) )
    return self.value(xThroughHidden1) 
  
  def action_value(self,obs): #Le estamos pasando solo la obs y las clases ya definidas
    # Executes `call()` under the hood.
    print("Enter action_value")
    value = self.predict_on_batch(obs)  #predict on batch, basically runs call()
    print("The value before squeeze: ", value.shape) 
    return np.squeeze(value, axis=-1)

In [13]:
Emodel = ex2Model(num_actions=env.action_space.n)
enterData = np.zeros((4,22,8))
print(enterData.shape)
print(Emodel.action_value(enterData))

__init__ started
__init__ finished
(4, 22, 8)
Enter action_value
call inputs:  Tensor("input_1_2:0", shape=(None, 22, 8), dtype=float32)
Converted inputs to tensor  Tensor("input_1_2:0", shape=(None, 22, 8), dtype=float32)
passed tensorized inputs through the hidden layer  Tensor("ex2_model/dense_2/Identity:0", shape=(None, 22, 128), dtype=float32)
pass the results of the hidden layer 1, through a dense layer with output 1:  Tensor("ex2_model/value/Identity:0", shape=(None, 22, 1), dtype=float32)
call inputs:  Tensor("self:0", shape=(4, 22, 8), dtype=float32)
Converted inputs to tensor  Tensor("self:0", shape=(4, 22, 8), dtype=float32)
passed tensorized inputs through the hidden layer  Tensor("ex2_model/dense_2/Relu:0", shape=(4, 22, 128), dtype=float32)
pass the results of the hidden layer 1, through a dense layer with output 1:  Tensor("ex2_model/value/BiasAdd:0", shape=(4, 22, 1), dtype=float32)
The value before squeeze:  (4, 22, 1)
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

Una pequeña e interesante revelación. Sea como sea el input (Cualquier shape) siempre call() corre dos veces, Si tu shape es (Y,N,Z) corre una vez con shape (None,N,Z) la otra con (Y,N,Z).

PERO la primera vez que corre no te regresa "un valor", te regresa un tensor de shape shape=(None, N, DenseOutputShape) pero "sin nada adentro??". POR otro lado la segunda ya te regresa algo con la shape (Y,N,DenseOutputShape). Esto creo que es interesante porque podrías hacer Y*N*DenseOutputShape y ahora tienes todo unidimensional

#Regresando al la función original

Basicamente call(), nos regresa dos valores del mismo input, uno de shape num_actions y otro de shape 1.

La situación es que despues del primer predict_on_batch, hay otro, pero este otro tiene un .dist antes (que es un ProbabilityDistribution).

Nota que en el second predict_on_batch, no pasa por call(), porque no se imprime todo

In [0]:
class Model(tf.keras.Model): #what the model is: https://www.tensorflow.org/api_docs/python/tf/keras/Model
   #I think this gets tf.keras.Model and modifies it
  def __init__(self, num_actions):
    super().__init__("mlp_policy") #mlp_policy is probably just a name
    self.hidden1 = kl.Dense(128, activation = 'relu') #I dont think we are adding layers, but creating it?
    self.hidden2 = kl.Dense(64, activation= 'relu')
    self.value = kl.Dense(1, name = "value") #the 1 is the dimensionality of the output
    # Logits are unnormalized log probabilities.
    self.logits = kl.Dense(num_actions, name = "policy_logits")
    self.dist = ProbabilityDistribution()
  
  def call(self,inputs, **kwargs):
    # Inputs is a numpy array, convert to a tensor.
    #print("Call start ","inputs: ", inputs)
    x = tf.convert_to_tensor(inputs)
    #print("x converted to tensor: ", x)
    hidden_logs = self.hidden1(x) #This are two different networks?
    #print("got hidden logs, ", hidden_logs)
    hidden_vals = self.hidden2(x)
    #print("got hidden vals, ", hidden_vals)
    #print("The logits are: ", self.logits(hidden_logs))
    #print("The values are: ", self.value(hidden_vals))

    return self.logits(hidden_logs), self.value(hidden_vals)
  
  def action_value(self,obs): #Le estamos pasando solo la obs y las clases ya definidas
    # Executes `call()` under the hood.
    #print("Action")
    logits, value = self.predict_on_batch(obs) 
    #print("Back to action_value")
    #print("After first predict_on_batch: got logits", logits)
    #print("After first predict_on_batch: got value", value) 
    #Nota que en el second predict_on_batch, no pasa por call()
    #Como que le estamos diciendo que haga predict on batch pero de la distribución.
    action = self.dist.predict_on_batch(logits) #dist es probability distribution. You get it from the logits
    #print("After second predict on batch, the action array is: ", action)
    return np.squeeze(action, axis=-1), np.squeeze(value, axis=-1)

In [15]:
env = gym.make('CartPole-v0')
model = Model(num_actions=env.action_space.n)

obs = env.reset()
print(obs.shape)
# No feed_dict or tf.Session() needed at all!
action, value = model.action_value(obs[None, :])
print(action, value)

(4,)
1 [0.00621449]


#Como funciona la función:
1. La función toma como base (tf.keras.Model) y a partir de ahí modificas lo que sea necesario.

2. en __init__():
<ul>
  <li>
  lo que haces es indicar, que al inicializar en modelo necesitas pasarle el num_actions
  </li>
  <li>
    Y aparte defines funciones dentro de la clase, en este caso definimos dos redes (independientes) hidden1 y hidden2,(con pesos independientes.). También definimos otras dos redes de salida, value y logits. Así como una ProbabilityDistribution() function.
  </li>
</ul> 

3. En call(), lo que pasa es que al correr self.predict_on_batch() corre call(). call() lo que hace es:
<ul>
  <li>
  Tomar el array que le damos como input y transformarlo a un tensor. Luego pasa este input (independientemente) por cada red. (A hidden1 y hidden2 le pasamos el mismo input pero nos da diferente valor). y los valores que saca de una red (hidden1) la pasa a logits (que es una Dense con output de num_action) y los valores de otra red (hidden2) los pasa a Value (que es una Dense con output 1)
  </li>
  <li>
    Nota: en call corre dos veces la primera vez que lo llamas, en esta primera vez corre con el array pero la shape la deja como (None, N), en vez de tu shape original, luego ya lo corre con tu shape original (Y todas las funciones que corran en call() después no te hacen esto de correr con (None)
  </li>
</ul> 

4. Al final esta action_value, que lo primero que hace es correr call(). Y luego usa la probability distribution con uno de los valores que regresa call para sacar que acción usamos.

#Agent Interface

Now we can move on to the fun stuff – the agent class. First, we add a test method that runs through a full episode, keeping track of the rewards.

In [0]:
class A2CAgent:
  def __init__(self, model): #literal solo le pasas el modelo que hicimos antes
    self.model = model

  def test(self, env, render=True):
    obs, done, ep_reward = env.reset(), False, 0 #Get the initial values from the environment
    while not done:
      action, _ = self.model.action_value(obs[None, :]) #returns the action the model recomends and a value? (What is the value for?)
      obs, reward, done, _ = env.step(action)
      ep_reward += reward #get the reward of the action
      #if render: #I think this is for visualization purposes.
       # env.render()
    return ep_reward

Now we can check how much the agent scores with randomly initialized weights:

In [0]:
model = Model(num_actions=env.action_space.n)

In [18]:
agent = A2CAgent(model)
rewards_sum = agent.test(env)
print("%d out of 200" % rewards_sum)

16 out of 200


#The training

#Loss / Objective Function and The training loop
an agent improves its policy through gradient descent based on some loss (objective) function. In the A2C algorithm, we train on three objectives: improve policy with advantage weighted gradients, maximize the entropy, and minimize value estimate errors.

In [0]:
import tensorflow.keras.losses as kls
import tensorflow.keras.optimizers as ko


Get the full agent class

In [0]:
class A2CAgent:
  def __init__(self, model, lr=7e-3, gamma=0.99, value_c=0.5, entropy_c=1e-4):
    # Coefficients are used for the loss terms.
    self.value_c = value_c
    self.entropy_c = entropy_c
    self.gamma = gamma

    self.model = model
    self.model.compile(
      optimizer=ko.RMSprop(lr=lr),
      # Define separate losses for policy logits and value estimate.
      loss=[self._logits_loss, self._value_loss])

  def test(self, env, render=True):
    obs, done, ep_reward = env.reset(), False, 0 #Get the initial values from the environment
    while not done:
      action, _ = self.model.action_value(obs[None, :]) #returns the action the model recomends and a value? (What is the value for?)
      obs, reward, done, _ = env.step(action)
      ep_reward += reward #get the reward of the action
      #if render: #I think this is for visualization purposes.
       # env.render()
    return ep_reward

  def _value_loss(self, returns, value):
    # Value loss is typically MSE between value estimates and returns.
    return self.value_c * kls.mean_squared_error(returns, value)

  def _logits_loss(self, actions_and_advantages, logits):
    # A trick to input actions and advantages through the same API.
    # split that value into 2 ¿through the -1 axis?
    actions, advantages = tf.split(actions_and_advantages, 2, axis=-1)

    # Sparse categorical CE loss obj that supports sample_weight arg on `call()`.
    # `from_logits` argument ensures transformation into normalized probabilities.
    weighted_sparse_ce = kls.SparseCategoricalCrossentropy(from_logits=True)

    # Policy loss is defined by policy gradients, weighted by advantages.
    # Note: we only calculate the loss on the actions we've actually taken.
    actions = tf.cast(actions, tf.int32) #transforms into shape tf.int32
    policy_loss = weighted_sparse_ce(actions, logits, sample_weight=advantages)

    # Entropy loss can be calculated as cross-entropy over itself.
    probs = tf.nn.softmax(logits)
    entropy_loss = kls.categorical_crossentropy(probs, probs)

    # We want to minimize policy and maximize entropy losses.
    # Here signs are flipped because the optimizer minimizes.
    return policy_loss - self.entropy_c * entropy_loss

  def _returns_advantages(self, rewards, dones, values, next_value):
    # `next_value` is the bootstrap value estimate of the future state (critic).
    #append next_value to np_zeroes_like(rewards) through the -1 axis, and that is returns
    returns = np.append(np.zeros_like(rewards), next_value, axis=-1)

    # Returns are calculated as discounted sum of future rewards.
    for t in reversed(range(rewards.shape[0])): 
      #bellmans equation
      returns[t] = rewards[t] + self.gamma * returns[t + 1] * (1 - dones[t])
    returns = returns[:-1] #Take just a single value

    # Advantages are equal to returns - baseline (value estimates in our case).
    advantages = returns - values

    return returns, advantages

  def train(self, env, batch_sz=64, updates=250):
    # Storage helpers for a single batch of data.
    actions = np.empty((batch_sz,), dtype=np.int32)
    rewards, dones, values = np.empty((3, batch_sz)) #Create 3 empty arrays of batch_size:
    observations = np.empty((batch_sz,) + env.observation_space.shape) #create an array of (batchsize, observation_space.shape)

    # Training loop: collect samples, send to optimizer, repeat updates times.
    ep_rewards = [0.0] #current episode rewards
    next_obs = env.reset() #get observation from the environment
    for update in range(updates): #updates is how many times we play the game (our batch count)
      for step in range(batch_sz): #run 64 frames to get the batch
        observations[step] = next_obs.copy() #set the next_obs in an array of observation
        actions[step], values[step] = self.model.action_value(next_obs[None, :]) #pass the obs into the action of our model (the two nnets)
        next_obs, rewards[step], dones[step], _ = env.step(actions[step]) #get the next state of the nevironment with the action taken

        ep_rewards[-1] += rewards[step]
        if dones[step]:  #if we win or loose.
          ep_rewards.append(0.0) #start new rewards
          next_obs = env.reset() #start new environment
          logging.info("Episode: %03d, Reward: %03d" % (
            len(ep_rewards) - 1, ep_rewards[-2]))

      _, next_value = self.model.action_value(next_obs[None, :]) #after a single batch is finished, run another one before starting the new loop

      returns, advs = self._returns_advantages(rewards, dones, values, next_value)
      # A trick to input actions and advantages through same API.
      acts_and_advs = np.concatenate([actions[:, None], advs[:, None]], axis=-1)

      # Performs a full training step on the collected batch.
      # Note: no need to mess around with gradients, Keras API handles it.
      #observations is the training data, [...the other array...] is the target data.
      losses = self.model.train_on_batch(observations, [acts_and_advs, returns])

      logging.debug("[%d/%d] Losses: %s" % (update + 1, updates, losses))

    return ep_rewards



#Results

We are now all set to train our single-worker A2C agent on CartPole-v0! The training process should take a couple of minutes. After the training is complete, you should see an agent achieve the target 200 out of 200 score.

In [21]:
agent = A2CAgent(model)
rewards_history = agent.train(env)
print("Finished training, testing...")
print("%d out of 200" % agent.test(env)) # 200 out of 200

Finished training, testing...
122 out of 200


#Experimenting with the agent class

Since the agent class is a giantic beast and a mess. I will see how it works

In [0]:
class A2CAgentExperiment:
  def __init__(self, model, lr=7e-3, gamma=0.99, value_c=0.5, entropy_c=1e-4):
    print("Enter __init__")
    # Coefficients are used for the loss terms.
    self.value_c = value_c
    self.entropy_c = entropy_c
    self.gamma = gamma

    self.model = model
    print("before __init__ compile")
    self.model.compile(
      optimizer=ko.RMSprop(lr=lr),
      # Define separate losses for policy logits and value estimate.
      loss=[self._logits_loss, self._value_loss])

    print("end of __init__")

  def test(self, env, render=True):
    obs, done, ep_reward = env.reset(), False, 0 #Get the initial values from the environment
    while not done:
      action, _ = self.model.action_value(obs[None, :]) #returns the action the model recomends and a value? (What is the value for?)
      obs, reward, done, _ = env.step(action)
      ep_reward += reward #get the reward of the action
      #if render: #I think this is for visualization purposes.
       # env.render()
    return ep_reward

  def _value_loss(self, returns1, value1):
    print("Value loss enter, Returns: ", kls.mean_squared_error(returns1, value1))
    # Value loss is typically MSE between value estimates and returns.
    return self.value_c * kls.mean_squared_error(returns1, value1)

  def _logits_loss(self, actions_and_advantages, logits):
    print("_logits_loss enter")
    # A trick to input actions and advantages through the same API.
    # split that value into 2 ¿through the -1 axis?
    actions, advantages = tf.split(actions_and_advantages, 2, axis=-1)

    # Sparse categorical CE loss obj that supports sample_weight arg on `call()`.
    # `from_logits` argument ensures transformation into normalized probabilities.
    weighted_sparse_ce = kls.SparseCategoricalCrossentropy(from_logits=True)

    # Policy loss is defined by policy gradients, weighted by advantages.
    # Note: we only calculate the loss on the actions we've actually taken.
    actions = tf.cast(actions, tf.int32) #transforms into shape tf.int32
    policy_loss = weighted_sparse_ce(actions, logits, sample_weight=advantages)

    # Entropy loss can be calculated as cross-entropy over itself.
    probs = tf.nn.softmax(logits)
    entropy_loss = kls.categorical_crossentropy(probs, probs)

    # We want to minimize policy and maximize entropy losses.
    # Here signs are flipped because the optimizer minimizes.
    return policy_loss - self.entropy_c * entropy_loss

  def _returns_advantages(self, rewards, dones, values, next_value):
    print("_returns_advantages enter")
    # `next_value` is the bootstrap value estimate of the future state (critic).
    #append next_value to np_zeroes_like(rewards) through the -1 axis, and that is returns
    returns = np.append(np.zeros_like(rewards), next_value, axis=-1)
    # Returns are calculated as discounted sum of future rewards.
    #print("rewards.shape[0] ", rewards.shape[0])
    for t in reversed(range(rewards.shape[0])):
      #bellmans equation
      returns[t] = rewards[t] + self.gamma * returns[t + 1] * (1 - dones[t])
    returns = returns[:-1] #Take only 64 values
    
    #print("Bellmans result is returns: ", returns, " the values are taken from the nnets, values: ", values)

    # Advantages are equal to returns - baseline (value estimates in our case).
    advantages = returns - values

    return returns, advantages

  def train(self, env, batch_sz=64, updates=250):
    print("train enter")
    # Storage helpers for a single batch of data.
    actions = np.empty((batch_sz,), dtype=np.int32)
    rewards, dones, values = np.empty((3, batch_sz)) #Create 3 empty arrays of batch_size:
    observations = np.empty((batch_sz,) + env.observation_space.shape) #create an array of (batchsize, observation_space.shape)
    print("create empty arrays for actions, rewards, dones, values, and observations")
    print(actions.shape, " ", rewards.shape, " ", dones.shape, " ", values.shape, " ", observations.shape)

    # Training loop: collect samples, send to optimizer, repeat updates times.
    ep_rewards = [0.0] #current episode rewards
    next_obs = env.reset() #get observation from the environment
    for update in range(updates): #updates is how many times we play the game (our batch count)
      for step in range(batch_sz): #run 64 frames to get the batch
        observations[step] = next_obs.copy() #set the next_obs in an array of observation
        actions[step], values[step] = self.model.action_value(next_obs[None, :]) #pass the obs into the action of our model (the two nnets)
        next_obs, rewards[step], dones[step], _ = env.step(actions[step]) #get the next state of the nevironment with the action taken

        ep_rewards[-1] += rewards[step]
        if dones[step]:  #if we win or loose.
          ep_rewards.append(0.0) #start new rewards
          next_obs = env.reset() #start new environment
          logging.info("Episode: %03d, Reward: %03d" % (
            len(ep_rewards) - 1, ep_rewards[-2]))

      #Actually, this next_value is to calculate bellmans,
      _, next_value = self.model.action_value(next_obs[None, :]) #after a single batch is finished, run another one before starting the new loop

      returns, advs = self._returns_advantages(rewards, dones, values, next_value) #Aqui pasas de train a returns_advantages
      # A trick to input actions and advantages through same API.
      acts_and_advs = np.concatenate([actions[:, None], advs[:, None]], axis=-1)

      # Performs a full training step on the collected batch.
      # Note: no need to mess around with gradients, Keras API handles it.
      #observations is the training data, [...the other array...] is the target data.
      #Train on batch uses the loss we defined in init, which takes logit_loss and value loss.
      #I think it runs twice, once per nnet
      print("train_on_batch")
      losses = self.model.train_on_batch(observations, [acts_and_advs, returns]) #Creo que aquí es donde entra logits loss
      logging.debug("[%d/%d] Losses: %s" % (update + 1, updates, losses))

    return ep_rewards



In [23]:
ExpAgent = A2CAgentExperiment(model)
rewards_history = ExpAgent.train(env)
print("Finished training, testing...")
print("%d out of 200" % ExpAgent.test(env)) # 200 out of 200

Enter __init__
before __init__ compile
_logits_loss enter
Value loss enter, Returns:  Tensor("loss_1/output_2_loss/Mean:0", shape=(None,), dtype=float32)
end of __init__
train enter
create empty arrays for actions, rewards, dones, values, and observations
(64,)   (64,)   (64,)   (64,)   (64, 4)
_returns_advantages enter
train_on_batch
_logits_loss enter
Value loss enter, Returns:  Tensor("loss/output_2_loss/Mean:0", shape=(64,), dtype=float32)
_logits_loss enter
Value loss enter, Returns:  Tensor("loss/output_2_loss/Mean:0", shape=(64,), dtype=float32)
_returns_advantages enter
train_on_batch
_returns_advantages enter
train_on_batch
_returns_advantages enter
train_on_batch
_returns_advantages enter
train_on_batch
_returns_advantages enter
train_on_batch
_returns_advantages enter
train_on_batch
_returns_advantages enter
train_on_batch
_returns_advantages enter
train_on_batch
_returns_advantages enter
train_on_batch
_returns_advantages enter
train_on_batch
_returns_advantages enter
train

#El camino de ejecución del agente

1. __init__

2. _logits_loss y value loss. <-Por el compile

3. The train loop starts
	<ul>
  <li>
    1. Sets up the environment and the arrays (actions, rewards, dones, values, observations), all have a shape (64,)   (64,)   (64,)   (64,)   (64, 4) [the batch size]
  </li>
  
	2. Every update
		1. there is a step loop where 64 steps (batch_size) of the game is played.
    
            *Where we call the model action_value function to get actions and values for every step in the environment - We are calling the neural net to tell us what to do, and then we call the environment to tell us what happened.

		2. After the batch is finished the action_value function from the model is called to get the next_value of the next_observation, this is because we need one more next_value because bellman asks for the gamma * returns[t+1] (the returns of the next action) * ....  If we did not have this step, we could not compute bellmans.

		3. The _returns_advantages function is called (This is the Bellmans function) with the values of the batch run.
			Note: The returns are the Bellmans equation, and the values is what we got from the neural net. **I am starting to think that what we want to do is basically have a neural net that can approximate the value of the Bellman function for each action in each state, that way we can pick the best course of action**
			Note2: The advantages is our "loss", we need to minimize the difference between the real bellmans and our values

		4. We make an array of ations taken and the advantage of every action by concatenating both the actions and the advantage arrays  (It is an array of 64,2 (two arrays of 64)).

		5. We use the keras train_on_batch function and pass it our observations, the acts_and_advs and the returns. 
			Here is where the loss function, that we told the model that it was the logits_loss and value_loss in __init__ is activated

			NOTE: The train_on_batch function is what actually changes our networks. based on the observations (which is a batch of (64,4) and the other ones.).

      </ul>