# Actor-Critic
there are different types of policy-based reinforcement-learning
![equations](images/RL_equations.png)
[PyTorchA2CPolicy](https://gitlab.tu-berlin.de/OKS/plato/blob/actorcritic/DialogueManagement/DialoguePolicy/ReinforcementLearning/pytorch_a2c_policy.py#L78) implements the generalized A2C-Algorithm; A2C = Advantage Actor Critic  
The NeuralNetwork is trained with a Loss that is comprised of three parts: 
1. policy-loss (for actor)
2. entropy-regularization
3. value-loss (for critic)

it is trained in a __off-policy__ manner. Means, that losses are calculated with experience (dialogues) which where generated by an __outdate/old__ policy, this experience is __off-policy__. Hopefully not too far __off__. 
The __value-loss__ is the mean-squared-error of the __state-value__-estimate in step __k__ and the __overall-value__ in step __k+1__. In code this overall-value is called __returnn__. It is actually itself just an estimation and calculated by the state-value + advantage in step __k+1__

In [None]:
def calc_loss(exps: Rollout, agent: AbstractA2CAgent, p: A2CParams):
    dist, value = agent.calc_distr_value(exps.env_steps.observation)
    entropy = dist.entropy().mean()
    policy_loss = -(dist.log_prob(**exps.agent_steps.actions) * exps.advantages).mean()
    value_loss = (value - exps.returnn).pow(2).mean()
    loss = policy_loss - p.entropy_coef * entropy + p.value_loss_coef * value_loss
    return loss

# Neural Network Architecture
The Neural Network consists of: 
    1. encoder
    2. actor-head
    3. critic-head
    

In [None]:
class PolicyA2CAgent(AbstractA2CAgent):
    def __init__(
        self,
        vocab_size,
        num_intents,
        num_slots,
        encode_dim=64,
        embed_dim=32,
        padding_idx=None,
    ) -> None:
        super().__init__()
        self.encoder = StateEncoder(vocab_size, encode_dim, embed_dim, padding_idx)
        self.actor = Actor(encode_dim, num_intents, num_slots)
        self.critic = nn.Linear(encode_dim, 1)

    def forward(self, x):
        features_pooled = self.encoder(x)
        intent_probs, slots_sigms = self.actor(features_pooled)
        value = self.critic(features_pooled)
        return (intent_probs, slots_sigms), value

    def calc_value(self, x):
        value = self.critic(self.encoder(x))
        return value

    def calc_distr_value(self, state):
        (intent_probs, slot_sigms), value = self.forward(state)
        distr = CommonDistribution(intent_probs, slot_sigms)
        return distr, value

    def calc_distr(self, state):
        distr, value = self.calc_distr_value(state)
        return distr

    def step(self, x) -> AgentStep:
        (intent_probs, slot_sigms), value = self.forward(x)
        distr = CommonDistribution(intent_probs, slot_sigms)
        intent, slots = distr.sample()
        v_values = value.data
        return AgentStep((intent.item(), slots.numpy()), v_values)


# generalized Advantage Calculation
### Advantage
"how much better is a certain action towards the average". in other words: "What is the surplus of chosing this action in comparance to all other actions". The average value  _V_ of a state is an action indepentend estimation by the critic. 
![advantage](images/advantage.jpg)

### "generalization" 
![generalized](images/generalized_a2c_equation.jpg)
in practice this sum does not go to infinity but some steps (e.g. 5) "into the future". The generalization is a discounted sum over future advantages. In the code these are called __bellman_delta__

details see [berkeley-lecture-notes](http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_5_actor_critic_pdf#page=24)


In [None]:
def generalized_advantage_estimation(
    rewards, values, dones, num_rollout_steps, discount, gae_lambda
):
    assert values.shape[0] == 1 + num_rollout_steps
    advantage_buffer = torch.zeros(rewards.shape[0] - 1, rewards.shape[1])
    next_advantage = 0
    for i in reversed(range(num_rollout_steps)):
        mask = torch.tensor((1 - dones[i + 1]), dtype=torch.float32)
        bellman_delta = rewards[i + 1] + discount * values[i + 1] * mask - values[i]
        advantage_buffer[i] = (
            bellman_delta + discount * gae_lambda * next_advantage * mask
        )
        next_advantage = advantage_buffer[i]
    return advantage_buffer


in the [process_dialogue_to_turns](https://gitlab.tu-berlin.de/OKS/plato/blob/actorcritic/DialogueManagement/DialoguePolicy/ReinforcementLearning/pytorch_a2c_policy.py#L148) method the turns with acts that have not been created by the policy have to be filtered out. 
the __discounted_returns__ are currently not used, actually I wanted to substitude them with the rewards. But seems also to work like this. 
     

In [None]:
 
    def process_dialogue_to_turns(self, dialogue: List[Dict]) -> List[DialogTurn]:
        assert dialogue[0]["action"][0].intent == "welcomemsg"
        assert dialogue[-1]["action"][0].intent == "bye"
        dialogue[-2]["reward"] = dialogue[-1]["reward"]
        dialogue = dialogue[1:-1]
        rewards = [t["reward"] for t in dialogue]
        returns = calc_discounted_returns(rewards, self.gamma)
        turns = [
            DialogTurn(
                d["action"][0],
                tokenize(self.text_field, d["state"]),
                d["reward"],
                d["state"].value,
            )
            for d, ret in zip(dialogue, returns)
            if hasattr(d["state"], "value")
        ]
        return turns


# State-Encoding
1. convert state to json
2. tokenize and map to sequence of integers

In [None]:
def state_to_json(state:SlotFillingDialogueState)->str:
    temp = deepcopy(state)
    del temp.context
    del temp.system_requestable_slot_entropies
    del temp.db_result
    del temp.dialogStateUuid
    del temp.user_goal
    del temp.slots
    del temp.item_in_focus
    temp.db_matches_ratio = int(round(temp.db_matches_ratio, 2) * 100)
    temp.slots_filled = [s for s,v in temp.slots_filled.items() if v is not None]
    if temp.last_sys_acts is not None:
        temp.last_sys_acts = action_to_string(temp.last_sys_acts, system=True)
        temp.user_acts = action_to_string(temp.user_acts, system=False)

    d = todict(temp)
    assert d is not None
    # d['item_in_focus'] = [(k,d['item_in_focus'] is not None and d['item_in_focus'].get(k,None) is not None) for k in self.domain.requestable_slots]
    s = json.dumps(d)
    # state_enc = int(hashlib.sha1(s.encode('utf-8')).hexdigest(), 32)
    return s

def tokenize(text_field, state: SlotFillingDialogueState):
    state_string = state_to_json(state)
    example = Example.fromlist([state_string], [("dialog_state", text_field)])
    tokens = [t for t in example.dialog_state if t in text_field.vocab.stoi]
    return tokens
