# Continuous Control with Deep Reinforcement Learning: Deep Deterministic Policy Gradient

[paper link](https://arxiv.org/pdf/1509.02971.pdf)

## 1) How do we deal with actions in continuous space?

### Discrete action space


In [42]:
%%html
<img src="img/cartpole.gif", width=200,height=200><img src="img/cartpole_actions.png", width=200,height=200>

### Continuous action space

In [43]:
%%html
<img src="img/pendulum.gif", width=200,height=200><img src="img/pendulum_actions.png", width=200,height=200>

## 2) What is a Replay Buffer and why do we need it?
### Learning algorithms assume samples are independently and identically distributed! 

* Create a queue to hold a memory of all samples.
* Push all state, action, reward, next state tuples to the queue. 
* Sample from queue when learning to get a random sample batch.


In [44]:
%%html 
<img src="img/memory.png",width=1,height=1>

<img src="img/lander.gif",width=400,height=400>

## 3) What is an off-policy algorithm?
Take from https://stats.stackexchange.com/questions/184657/what-is-the-difference-between-off-policy-and-on-policy-learning


* In off-policy learning the Q(s,a) function is learned from different actions (for example, random actions). We even don't need a policy at all!

## 4) How do we explore the action space?

* Add some random noise to our actions!
* Ornstein–Uhlenbeck process
* Decay random action over time


In [45]:
%%html 
<img src="img/ou_noise.png",width=400,height=400>

## 5) What is actor-critic?

* Actor takes in a state and outputs an action
* Critic takes in a state + action and outputs a value

In [46]:
%%html 
<img src="img/ac.jpg",width=400,height=400>

## 6) Algorithm Step by Step

In [55]:
%%html 
<img src="img/ddpg_algo.png",width=400,height=400>

In [48]:
# import custom function in this dir to build ddpg network
from networks import Actor, Critic
from replay_buffer import Memory
from ou_noise import OUNoise

# import other lib used
import gym 
import numpy as np

In [57]:
from networks import Actor, Critic
from replay_buffer import Memory
import gym
from ou_noise import OUNoise
import numpy as np
from random import random

class Game(object):
    def __init__(self, state_size, action_size,
                 dense_units, gamma, tau, episodes):

        self.state_size = state_size
        self.action_size = action_size
        self.dense_units = dense_units
        self.gamma = gamma
        self.tau = tau
        self.episodes = episodes

        # 1.
        self.create_networks()
        # 2.
        self.copy_weights()
        # 3.
        self.memory = Memory(10000, 32)
        self.env = gym.make('Pendulum-v0')
        self.noise = OUNoise()

    def create_networks(self):
        self.actor = Actor(self.state_size,
                           self.dense_units,
                           self.action_size)
        
        self.actor_target = Actor(self.state_size,
                                  self.dense_units,
                                  self.action_size)

        self.critic = Critic(self.state_size + self.action_size,
                             self.dense_units, self.action_size)
        
        self.critic_target = Critic(self.state_size + self.action_size,
                                    self.dense_units, self.action_size)

    def copy_weights(self):
        self.actor_target.model.set_weights(self.actor.model.get_weights())
        self.critic_target.model.set_weights(self.critic.model.get_weights())

    def _update_critic(self, state, action, reward, next_state):
        # Update the critic model
        next_action = self.actor_target.model.predict(np.array(next_state))
        
        # Create an array of next_state, next_action to feed into critic target
        next_state_action = []
        for idx, cur_act in enumerate(next_action, 0):
            cur_state = next_state[idx]
            next_state_action.append(np.concatenate((cur_state, cur_act)))
        next_state_action = np.array(next_state_action)
        
        # Feed next_state_action into critic_target to get y
        next_q = self.critic_target.model.predict(next_state_action)
        q_prime = reward + self.gamma * next_q
        y = np.array(q_prime)


        state_action = []
        for idx, cur_act in enumerate(action, 0):
            cur_state = state[idx]
            state_action.append(np.concatenate((cur_state, cur_act)))
        state_action = np.array(state_action)

        self.critic.model.fit(state_action, y, verbose=0)
        
    def _update_actor(self, state):

        # Update the actor Model
        state = np.array(state)
        new_action = self.actor.model.predict(state)

        state_new_action = []
        for idx, cur_act in enumerate(new_action, 0):
            cur_state = state[idx]
            state_new_action.append(np.concatenate((cur_state, cur_act)))

        state_new_action = np.array(state_new_action)
        pred = self.critic.model.predict(state_new_action)
        self.actor.model.fit(state, pred, verbose=0)
    
    def _update_target_networks(self):

        # Update target_critic network
        critic_weights = self.critic.model.get_weights()
        critic_target_weights = self.critic_target.model.get_weights()
        for i in range(len(critic_target_weights)):
            critic_target_weights[i] = critic_weights[i] * self.tau + critic_target_weights[i] * (1.0 - self.tau)
        self.critic_target.model.set_weights(critic_target_weights)

        
        # Update target_actor network
        actor_weights = self.actor.model.get_weights()
        actor_target_weights = self.actor_target.model.get_weights()
        for i in range(len(actor_target_weights)):
            actor_target_weights[i] = actor_weights[i] * self.tau + actor_target_weights[i] * (1.0 - self.tau)
        self.actor_target.model.set_weights(actor_target_weights)



    def replay(self, mini_batch):
        state, action, reward, next_state, _ = mini_batch
        assert len(action[0]) == 1
        assert len(state[0]) == 3

        self._update_critic(state, action, reward, next_state)
        self._update_actor(state)
        self._update_target_networks()

# DDPG algorithm steps
# 1. Create Actor and Critic Networks x
# 2. Copy weights from Actor -> Actor_target and Critic -> Critic_target x
# 3. Create replay buffer x
# 4. Create a loop for M episodes x
# 5. Init a random process N for action exporation x
# 6. Get init observation for state s_1 x
# 7. Loop through a trajectory T x
# 8. Select action based on current actor policy x
# 9. Take action from #8 and get observed reward and new state x
# 10. Store transition in replay buffer x
# 11. Sample random minibatch of transitions x
# 12. set y_i = r + gamma Q`(s_i+1, , u`(s_i+1)) x
# 13. update the ciritic using the calculated y_i x
# 14. update the actor policy using the sampled policy gradient

    def run(self):
        try:
            # 4.
            for index_episode in range(self.episodes):
                # 6.
                state = self.env.reset()
                state = np.reshape(state, self.state_size)
                done = False
                t = 0
                sum_reward = 0
                # 7.
                while not done:
                    # 8. But we don't really use it here
                    action = self.actor.act(state)
                    if (index_episode + 1) % 20== 0:
                        self.env.render()
                    else:
                        action = self.noise.get_action(action, t)
                    # 9.
                    next_state, reward, done, _ = self.env.step(action)
                    next_state = np.reshape(next_state, self.state_size)
                    # 10.
                    self.memory.push(state, action, reward, next_state, done)

                    state = next_state
                    t += 1
                    sum_reward += reward
                    # 11.
                    transition_samples = self.memory.sample()
                    if len(transition_samples[0]) == 32:
                        self.replay(transition_samples)
                
                if (index_episode + 1) %  20 == 0:
                    print("Episode {}# , Reward{}".format(index_episode, sum_reward))
                else:
                    print("{}".format(index_episode))
        finally:
            pass


In [58]:
MEM_CAP = 10000
NUM_LAYERS = 1
DENSE_UNITS = 128
STATE_SIZE = 3
ACTION_SIZE = 1
LEARNING_RATE = 0.001
GAMMA = 0.95
EPSILON = 0.05
MINIBATCH_SIZE = 32
EPISODES = 10000
TAU = 1e-2

if __name__ == "__main__":
    game = Game(STATE_SIZE, ACTION_SIZE, DENSE_UNITS, GAMMA, TAU, EPISODES)
    game.run()

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Episode 19# , Reward-1389.9717272501193
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Episode 39# , Reward-1366.31328757789
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Episode 59# , Reward-1192.6532635552762
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
Episode 79# , Reward-1728.7851085314599
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
Episode 99# , Reward-1720.8595933815798
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
Episode 119# , Reward-1233.9365831064276
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
Episode 139# , Reward-1540.5223081578658
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
Episode 159# , Reward-1367.5476245852076
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
Episode 179# , Reward-1371.9611746647322
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194

KeyboardInterrupt: 

In [59]:
%%html 
<img src="img/envs.png",width=400,height=400>

In [60]:
%%html 
<img src="img/pendulum_fin.gif",width=400,height=400>