# Stochastic Policy Implementation.

## Plan the changes

I will try not to modify the env and concentrate all the changes in the policy and in the algorithm. There are 2 major changes to execute:
- Inside the policy (`s_mlp_policy.py`): change the policy representation to accomodate the new latent inputs/units. Then change the `get_action` method so that it also samples from latents and return the observed latent.
- In the algorithm (`npo_snn.py`): declare `latent_var`, change the `surr_obj` and `logli`. The first and last thing will also be included in the previous class.
In the future the latent variables distribution will also depend on the parameters and the observations. This again will need a change in the `surr_obj`.



## Changes executed
In the algorithm file:
1. a) Declare the symbolic latent var :

In [None]:
latent_var = self.policy.latent_space.new_tensor_variable(
    'latent',
    extra_dims=1 + is_recurrent,
)

1. b) define the method in policy (see it's not in env, which is not supposed to know about latents)

In [None]:
@property
def latent_space(self):
    return Box(low= -np.inf, high=np.inf, shape=(1,))

So far so good. (it's of course not doing anything). We need to get the sampled latents to `optimize_policy` where the surr_loss will use them. For this, let's try to put this latent variables in `samples_data`. Let's put it empty, just to make sure the placeholder compiles through correctly.
- `samples_data` comes form `process_samples(itr,path)`. I have to change that one anyways to concatenate the latent var sampled from every path into the shape needed for the tensor operations.
- Inside `process_samples`: copy exactly everything done for "actions". Still should modify `agent_infos` for entropy calculation!! FOR LATER
- make `path` come with `"latents"`: go to `obtain_samples(itr)`, which calls `rllab.sampler.parallel_sampler.sample_paths`. This in turn sets all the workers to have the same parameters with `singleton_pool.run_each(..set_pol_para...)` and then makes them do the rollouts with `singleton_pool.run_collect(_worker_collect_one_path,...)`. Let's look at the function that collects the paths:

In [None]:
def _worker_collect_one_path(G, max_path_length):
    path = rollout(G.env, G.policy, max_path_length)
    return path, len(path["rewards"])

The `G` comes from the `train()` function, where it gets `env, policy`. We don't need to look at `singleton_pool` in `sample.stateful_pool`, just concentrate in having rollout properly in `rllab.sampler.utils`. BACKDOOR!! Put everything in agent_info: `a, agent_info = agent.get_action(o)`!!!
It does:

In [None]:
agent_infos.append(agent_info)  ##for every step it appends to the list the new dict. All dicts have the same keys
...
return dict(...
    agent_infos=tensor_utils.stack_tensor_dict_list(agent_infos), ## returns a dict with a single list with all step concatenated
...)

So we only need to change `get_action` from the policy! Now revisit what we were saying above: let's keep `process_samples` untouched and simply unpack `"latent"` in the same place we unpack `"mean"` and `"log_std"`. This means the following modifications, different from the ones proposed above:
- `init_opt`: I think it's better to still define the latent var apart, not with the old_dist_info_vars. Then add it to the `input_list` IN THE ORDER that it will be later fed in `optimize_policy` ie on how the list `all_input_values` is constructed.
- `optimize_policy`: append to the tuple `all_input_values` the latents

### Injecting noise.
To append noise variables to the input I could do 2 things: 
- append then to the observations after sampling them in `get_action` and then treating all as expanded observations. Downside: the observations get "polluted" and when optimizing, after recoverig `latent` from `agent_infos`, we should construct a new observations to include that noise variables.
- keep the latents separate and change `_f_dist` in the policy such that it also takes as input `latent_var`. Then construct the NN to accomodate that. So symbollically combine them. Let's do this last part.
Let's list the changes to do:
- I add to `get_action` a latent normally sampled (2,).
- change `dist_info_sym` to have also as input the latent

In [None]:
#A-where the dist will depend on latents 
dist_info_vars = self.policy.dist_info_sym(obs_var, latent_var, action_var)
#P-give also the latent to where we compute the output of the NN
def dist_info_sym(self, obs_var, latent_var, action_var):
    mean_var, log_std_var = L.get_output([self._l_mean, self._l_log_std], obs_var, latent_var)
    return dict(mean=mean_var, log_std=log_std_var)
##
obs_dim = env_spec.observation_space.flat_dim + latent_dim

check also `old_dist_info_vars`!! I think it's just the one that was sampled?

## New loss
See if we can use the new latents in the surrogate loss

In [11]:
import numpy as np
a = np.arange(12).reshape((4,3))
print a
b = np.random.randn(4,1)
print b
c = np.concatenate ((a,b), axis=1)
print c

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
[[ 0.60336135]
 [-1.196441  ]
 [ 0.0301206 ]
 [ 0.34410733]]
[[  0.           1.           2.           0.60336135]
 [  3.           4.           5.          -1.196441  ]
 [  6.           7.           8.           0.0301206 ]
 [  9.          10.          11.           0.34410733]]


In [13]:
import theano.tensor as TT
a = TT.fmatrix('a')


In [16]:
a.shape

Shape.0

In [17]:
a = np.array((1.332))

In [27]:
a = -0.00437
bucket = np.floor(a/0.01) #this truncates
print bucket

-1.0


In [30]:
bound = 3
num_bins=600
step = (2.*bound)/num_bins
print step
samples=num_bins*10
x = np.arange(-bound,bound+step, step)
print len(x)

0.01
601


In [31]:
range(3)

[0, 1, 2]

In [32]:
np.zeros(3)

array([ 0.,  0.,  0.])

In [None]:
from matplotlib import pyplot as plt
plt.close('all')