![임도형 커멘트](https://github.com/dhrim/keras_example_seminia_2020/raw/master/comment.png)

# 개요

- 원 본 : https://keras.io/examples/rl/actor_critic_cartpole/
- 작업 : 강화학습으로 막대기 세우기
- 데이터 : N/A
- 적용 모델 : DNN
- 적용 방법 : Actor Critic Method


<br>

# 데이터

N/A


<br>

# 모델

카트의 위치, 카트의 속도, 막대의 각도, 막대 맨위의 속도 4개를 입력으로 하고 
좌나 우로 카드를 미는 2개읠 출력을 내는 DNN


<br>

# OpenAI GYM CartPole-v0

https://github.com/openai/gym/wiki/CartPole-v0

Observation
- 차 위치
- 차 속도
- 막대 각도
- 막대 끝 속도

Actions
- 0 : 차를 우측으로 밀기
- 1 : 차를 죄측으로 밀기


<br>

# 강화학습

다음을 반복하여 학습한다.
1. action = model(state)
1. sate, reward = env(action)
1. reward로 model 업데이트



# 전체 코드 흐름

```
while 성공할때 까지

  while 넘어지거나 목표 스텝(1000)  <--- 요 루프 한번이 에피소드이다.

    actor, critic = model(sate)
    state, reward = env(action)

    actor, critic, reward를 히스토리에 저장


  critic 로스를 계산
  action 로스를 계산

  전체 로스 = critic 로스 + actor 로스

  전체 로스를 가지고 model을 업데이트

```

actor와 reward만 있으면 될 것 같은데, critic이 왜 필요한지 모르겠다.

아마도 그 이유가 Actor Critic Method의 핵심인 것 같다.

이건 이론적으로 이해해야 알 것 같고.













# 태그

```
#reinforce_learning
#cart_pole
#dnn
```

# Actor Critic Method

**Author:** [Apoorv Nandan](https://twitter.com/NandanApoorv)<br>
**Date created:** 2020/05/13<br>
**Last modified:** 2020/05/13<br>
**Description:** Implement Actor Critic Method in CartPole environment.

## Introduction

This script shows an implementation of Actor Critic method on CartPole-V0 environment.

### Actor Critic Method

As an agent takes actions and moves through an environment, it learns to map
the observed state of the environment to two possible outputs:

1. Recommended action: A probabiltiy value for each action in the action space.
   The part of the agent responsible for this output is called the **actor**.
2. Estimated rewards in the future: Sum of all rewards it expects to receive in the
   future. The part of the agent responsible for this output is the **critic**.

Agent and Critic learn to perform their tasks, such that the recommended actions
from the actor maximize the rewards.

### CartPole-V0

A pole is attached to a cart placed on a frictionless track. The agent has to apply
force to move the cart. It is rewarded for every time step the pole
remains upright. The agent, therefore, must learn to keep the pole from falling over.

### References

- [CartPole](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf)
- [Actor Critic Method](https://hal.inria.fr/hal-00840470/document)


## Setup


In [None]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Configuration parameters for the whole setup
seed = 42
gamma = 0.99  # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v0")  # Create the environment
env.seed(seed) # 재현을 위해.
eps = np.finfo(np.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0


## Implement Actor Critic network

This network learns two functions:

1. Actor: This takes as input the state of our environment and returns a
probability value for each action in its action space.
2. Critic: This takes as input the state of our environment and returns
an estimate of total rewards in the future.

In our implementation, they share the initial layer.


![임도형 커멘트](https://github.com/dhrim/keras_example_seminia_2020/raw/master/comment.png)

observation 4개를 입력으로 하고 actor_probability와 critic_value 2개를 출력하는 DNN


actor_probability는 [0,1, 0.9]의 모양. 0(좌측으로 밀기) 또는 1(우측으로 밀기)

critic은 보상 값.




In [None]:
num_inputs = 4
num_actions = 2
num_hidden = 128

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[action, critic])


![임도형 커멘트](https://github.com/dhrim/keras_example_seminia_2020/raw/master/comment.png)


**용어**
- episode : 한번 쭉 실행해 보는 것. 학습 하지 않고. 막대가 넘어지거나 최대 step이 지나면 종료된다.

## Train


In [None]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

while True:  # Run until solved
    state = env.reset()
    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # env.render(); Adding this line would show the attempts
            # of the agent in a pop up window.

            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)

            # Predict action probabilities and estimated future rewards
            # from environment state
            action_probs, critic_value = model(state)
            critic_value_history.append(critic_value[0, 0])

            # Sample action from action probability distribution
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
            action_probs_history.append(tf.math.log(action_probs[0, action]))

            # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        # - At each timestep what was the total reward received after that timestep
        # - Rewards in the past are discounted by multiplying them with gamma
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # At this point in history, the critic estimated that we would get a
            # total reward = `value` in the future. We took an action with log probability
            # of `log_prob` and ended up recieving a total reward = `ret`.
            # The actor must be updated so that it predicts an action that leads to
            # high rewards (compared to critic's estimate) with high probability.
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss

            # The critic must be updated so that it predicts a better estimate of
            # the future rewards.
            critic_losses.append(
                huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )

        # Backpropagation
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Clear the loss and reward history
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} falls after {:.0f} steps at episode {}"
        print(template.format(running_reward, episode_reward, episode_count))

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break

running reward: 14.47 falls after 25 steps at episode 10
running reward: 35.84 falls after 152 steps at episode 20
running reward: 76.86 falls after 145 steps at episode 30
running reward: 102.82 falls after 164 steps at episode 40
running reward: 111.63 falls after 42 steps at episode 50
running reward: 128.15 falls after 127 steps at episode 60
running reward: 141.36 falls after 200 steps at episode 70
running reward: 164.89 falls after 200 steps at episode 80
running reward: 172.59 falls after 200 steps at episode 90
running reward: 179.35 falls after 200 steps at episode 100
running reward: 183.70 falls after 200 steps at episode 110
running reward: 183.97 falls after 193 steps at episode 120
running reward: 190.20 falls after 200 steps at episode 130
running reward: 193.51 falls after 200 steps at episode 140
Solved at episode 146!


![임도형 커멘트](https://github.com/dhrim/keras_example_seminia_2020/raw/master/comment.png)

영어 커멘트 삭제하고 이해한 내용을 커멘트로 달았다.

변수명 살짝 바꿔주었다.

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0


while True:  # Run until solved
    state = env.reset()
    # sate = [-0.01258566 -0.00156614  0.04207708 -0.00180545]. 현재 상태 4개의 값.
    episode_reward = 0
    with tf.GradientTape() as tape:

        # 에피소드 1번을 쭉 실행한다.
        # 실행하면서 critic_value, actor_probs, reward의 히스토리를 저장한다.
        for timestep in range(1, max_steps_per_episode):

            state = tf.convert_to_tensor(state)
            # state = [-0.01303166 -0.04270197 -0.04540492  0.04584364]
            state = tf.expand_dims(state, 0)
            # statue= [[-0.01303166 -0.04270197 -0.04540492  0.04584364]]

            action_probs, critic_value = model(state)
            # action_probs = [0.38199136 0.6180086 ]
            # critic_value = [1.2710695]

            critic_value_history.append(critic_value[0, 0])

            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
            # action = 0
            action_probs_history.append(tf.math.log(action_probs[0, action]))

            state, reward, done, _ = env.step(action)
            # state = [ 0.04863056  0.17443249 -0.03863073 -0.31581318]
            # reward = 1. 항상 1
            # donw = False
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # 새로운 보상은 0.05의 비율로 추가하고, 기존의누적 보상은 0.95로 감소시킨다.
        # 새 에피소드가 
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward


        # returns에 담긴 값은 뒤에서 critics의 레이블링 값으로 사용된다.
        # 현 시점의 보상은 미래의 보상에 누적된다.
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)
        # returns = [8.64827525163591, 7.72553055720799, 6.793465209301, 5.8519850599, 4.90099501, 3.9403989999999998, 2.9701, 1.99, 1.0]

        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()
        # returns = [1.5310393742552764, 1.1572248237373701, 0.7796343686687778, 0.3982298685994925, 0.012972797822436946, -0.3761757585180234, -0.769255108356872, -1.1663049566789414, -1.5673654095295166]

        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for action_prob, critic_value, ret in history:
            # actor loss
            diff = ret - critic_value
            actor_losses.append(-action_prob * diff)

            # critic loss
            critic_losses.append(
                huber_loss(tf.expand_dims(critic_value, 0), tf.expand_dims(ret, 0))
            )

        # 전체 로스를 계산하고
        loss_value = sum(actor_losses) + sum(critic_losses)
        # 로스에 대한 gradient를 계산
        grads = tape.gradient(loss_value, model.trainable_variables)
        # 계산된 gradient로 모델을 업데이트
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # 히스토리들을 리셋
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    # 1번의 에피소드를 진행했다. 카운트 증가
    episode_count += 1

    if episode_count % 10 == 0:
        template = "running reward: {:.2f} falls after {:.0f} steps at episode {}"
        print(template.format(running_reward, episode_reward, episode_count))

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break


## Visualizations
In early stages of training:
![Imgur](https://i.imgur.com/5gCs5kH.gif)

In later stages of training:
![Imgur](https://i.imgur.com/5ziiZUD.gif)
