# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [3]:
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_v1/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_OneAgent/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_NoVis_OneAgent/Reacher_Linux_NoVis/Reacher.x86_64')
env = UnityEnvironment(file_name='/home/aras/unity-envs/Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [4]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [5]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [6]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.0


When finished, you can close the environment.

In [7]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [8]:
# Testing the train mode
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
#scores = np.zeros(num_agents)                          # initialize the score (for each agent)
num_steps = 0
while True:
    num_steps += 1
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #print(action)
    action = np.clip(action, -1, 1)                  # all actions between -1 and 1
    #print(action)
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    #scores += env_info.rewards                         # update the score (for each agent)
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        #print(action.shape, reward)
        #print(done)
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
num_steps

Total score (averaged over agents) this episode: 0.0


1001

## Option 1: Solve the First Version
The task is episodic, and in order to solve the environment, your agent must get an average score of +30 over 100 consecutive episodes.

In [9]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

  from ._conv import register_converters as _register_converters


TensorFlow Version: 1.7.1
Default GPU Device: 


In [10]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.float32, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    rates = tf.placeholder(tf.float32, [None], name='rates')
    return states, actions, targetQs, rates

In [11]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [12]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [13]:
def model_loss(action_size, hidden_size, states, actions, targetQs, rates):
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    neg_log_prob = tf.nn.sigmoid_cross_entropy_with_logits(logits=actions_logits, 
                                                           labels=actions)
    targetQs = tf.reshape(targetQs, shape=[-1, 1])
    gloss = tf.reduce_mean(neg_log_prob * targetQs) # DPG: r+(gamma*nextQ)
    gQs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    dQs = discriminator(actions=actions, hidden_size=hidden_size, states=states, reuse=True) # Qs
    rates = tf.reshape(rates, shape=[-1, 1])
    dlossA = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=gQs, # GAN
                                                                    labels=rates)) # 0-1
    dlossA += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=dQs, # GAN
                                                                     labels=rates)) # 0-1
    dlossA /= 2
    dlossQ = tf.reduce_mean(tf.square(gQs - targetQs)) # DQN
    dlossQ += tf.reduce_mean(tf.square(dQs - targetQs)) # DQN
    dlossQ /= 2
    return actions_logits, gQs, gloss, dlossA, dlossQ

In [14]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_lossA, d_lossQ, g_learning_rate, d_learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(g_learning_rate).minimize(g_loss, var_list=g_vars)
        d_optA = tf.train.AdamOptimizer(d_learning_rate).minimize(d_lossA, var_list=d_vars)
        d_optQ = tf.train.AdamOptimizer(d_learning_rate).minimize(d_lossQ, var_list=d_vars)

    return g_opt, d_optA, d_optQ

In [15]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, g_learning_rate, d_learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.rates = model_input(state_size=state_size, 
                                                                           action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_lossA, self.d_lossQ = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, rates=self.rates) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_optA, self.d_optQ = model_opt(g_loss=self.g_loss, 
                                                         d_lossA=self.d_lossA, 
                                                         d_lossQ=self.d_lossQ, 
                                                         g_learning_rate=g_learning_rate, 
                                                         d_learning_rate=d_learning_rate)

In [16]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size) # data batch
        self.rates = deque(maxlen=max_size) # rates
#     def sample(self, batch_size):
#         idx = np.random.choice(np.arange(len(self.buffer)), # ==  self.rates
#                                size=batch_size, 
#                                replace=False)
#         return [self.buffer[ii] for ii in idx], [self.rates[ii] for ii in idx]

In [17]:
# print('state size:{}'.format(states.shape), 
#       'actions:{}'.format(actions.shape)) 
# print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

In [18]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]
env_info.vector_observations.shape, env_info.previous_vector_actions.shape, \
brain.vector_action_space_size, brain.number_visual_observations, \
brain.vector_action_space_size, brain.vector_observation_space_size

((1, 33), (1, 4), 4, 0, 4, 33)

In [19]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01           # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 33
action_size = 4
hidden_size = 33*2             # number of units in each Q-network hidden layer
g_learning_rate = 1e-4         # Q-network learning rate
d_learning_rate = 1e-4         # Q-network learning rate

# Memory parameters
memory_size = int(1e5)            # memory capacity
batch_size = int(1e4)             # experience mini-batch size == one episode size is 1000/int(1e3) steps
gamma = 0.99                   # future reward discount

In [21]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size,
              g_learning_rate=g_learning_rate, d_learning_rate=d_learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [22]:
# Initializing the memory buffer
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
total_reward = 0
num_step = 0
for _ in range(memory_size):
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    action = np.clip(action, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    if reward > 0: print('reward >>>>>> 0', reward)
    if reward < 0: print('reward <<<<< 0', reward)
    done = env_info.local_done[0]                        # see if episode finished
    memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
    memory.rates.append(-1) # empty
    num_step += 1 # memory incremented
    total_reward += reward
    state = next_state
    if done is True:                                  # exit loop if episode finished
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)
        rate = total_reward/30
        total_reward = 0 # reset
        for idx in range(num_step): # episode length
            if memory.rates[-1-idx] == -1:
                memory.rates[-1-idx] = rate
        num_step = 0 # reset

reward >>>>>> 0 0.019999999552965164
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.019999999552965164
reward >>>>>> 0 0.019999999552965164
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.019999999552965164
reward >>>>>> 0 0.019999999552965164
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.00999

reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.009999999776482582
reward >>>>>> 0 0.029999999329447746
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.029999999329447746
reward >>>>>> 0 0.009999999776482582
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.019999999552965164
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.03999999910593033
reward >>>>>> 0 0.019999999552965164
reward >>>>>> 0 0.0399

In [23]:
batch = memory.buffer
percentage = 0.9
# statesL, actionsL, next_statesL, rewardsL, donesL, ratesL = [], [], [], [], [], []
# for idx in range(memory_size// batch_size):
idx_arr = np.arange(memory_size// batch_size)
idx = np.random.choice(idx_arr)
states = np.array([each[0] for each in batch])[idx*batch_size:(idx+1)*batch_size]
actions = np.array([each[1] for each in batch])[idx*batch_size:(idx+1)*batch_size]
print(actions.dtype)
next_states = np.array([each[2] for each in batch])[idx*batch_size:(idx+1)*batch_size]
rewards = np.array([each[3] for each in batch])[idx*batch_size:(idx+1)*batch_size]
dones = np.array([each[4] for each in batch])[idx*batch_size:(idx+1)*batch_size]
rates = np.array(memory.rates)[idx*batch_size:(idx+1)*batch_size]
print(states.shape, actions.shape, next_states.shape, rewards.shape, dones.shape, rates.shape)
states = states[rates >= (np.max(rates)*percentage)]
actions = actions[rates >= (np.max(rates)*percentage)]
next_states = next_states[rates >= (np.max(rates)*percentage)]
rewards = rewards[rates >= (np.max(rates)*percentage)]
dones = dones[rates >= (np.max(rates)*percentage)]
rates = rates[rates >= (np.max(rates)*percentage)]
print(states.shape, actions.shape, next_states.shape, rewards.shape, dones.shape, rates.shape)
# statesL.append(states)
# actionsL.append(actions)
# next_statesL.append(next_states)
# rewardsL.append(rewards)
# donesL.append(dones)
# ratesL.append(rates)

float64
(10000, 33) (10000, 4) (10000, 33) (10000,) (10000,) (10000,)
(1001, 33) (1001, 4) (1001, 33) (1001,) (1001,) (1001,)


In [24]:
# idx_arr = np.arange(memory_size// batch_size)
# idx = np.random.choice(idx_arr)
# idx*batch_size, (idx+1)*batch_size, memory_size// batch_size, memory_size, (10+1)*batch_size, \
# idx_arr

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list = [] # goal
rewards_list, gloss_list, dloss_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    idx_arr = np.arange(memory_size// batch_size) # randomness

    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0 # each episode
        gloss_batch, dlossA_batch, dlossQ_batch= [], [], []
        num_step = 0 # each episode
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)

        # Training steps/batches
        while True:
            # Explore (Env) or Exploit (Model)
            total_step += 1 # 1000 episode length
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p >= np.random.rand():
                #action = env.action_space.sample()
                action = np.random.randn(num_agents, action_size) # select an action (for each agent)
                action = np.clip(action, -1, 1)                  # all actions between -1 and 1
                #print('exploRE', action.dtype, action.shape)
            else:
                action = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
                #print('exploIT', action.dtype, action.shape)
            #next_state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]           # send all actions to tne environment
            #print(action.dtype, action.shape)
            #print(action.reshape([-1]).dtype)
            #print(action.reshape([-1]).shape)
            next_state = env_info.vector_observations[0]         # get next state (for each agent)
            reward = env_info.rewards[0]                         # get reward (for each agent)
            done = env_info.local_done[0]                        # see if episode finished
            memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
            memory.rates.append(-1) # empty
            num_step += 1 # momory added
            total_reward += reward
            state = next_state
            
            # Rating the memory
            if done is True:
                rate = total_reward/30 # update rate at the end/ when episode is done
                for idx in range(num_step): # episode length
                    if memory.rates[-1-idx] == -1: # double-check the landmark/marked indexes
                        memory.rates[-1-idx] = rate # rate the trajectory/data

            # Training with the maxrated minibatch
            batch = memory.buffer
            percentage = 0.9
            #for idx in range(memory_size// batch_size):
            idx = np.random.choice(idx_arr)
            states = np.array([each[0] for each in batch])[idx*batch_size:(idx+1)*batch_size]
            actions = np.array([each[1] for each in batch])[idx*batch_size:(idx+1)*batch_size]
            #print(actions.dtype,actions.shape)
            next_states = np.array([each[2] for each in batch])[idx*batch_size:(idx+1)*batch_size]
            rewards = np.array([each[3] for each in batch])[idx*batch_size:(idx+1)*batch_size]
            dones = np.array([each[4] for each in batch])[idx*batch_size:(idx+1)*batch_size]
            rates = np.array(memory.rates)[idx*batch_size:(idx+1)*batch_size]
            #print(states.shape, actions.shape, next_states.shape, rewards.shape, dones.shape, rates.shape)
            states = states[rates >= (np.max(rates)*percentage)]
            actions = actions[rates >= (np.max(rates)*percentage)]
            #print(actions.dtype,actions.shape)
            next_states = next_states[rates >= (np.max(rates)*percentage)]
            rewards = rewards[rates >= (np.max(rates)*percentage)]
            dones = dones[rates >= (np.max(rates)*percentage)]
            rates = rates[rates >= (np.max(rates)*percentage)]
            #print(states.shape, actions.shape, next_states.shape, rewards.shape, dones.shape, rates.shape)
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones) # DQN
            nextQs = nextQs_logits.reshape([-1]) * (1-dones) # DPG
            targetQs = rewards + (gamma * nextQs)
            #print(targetQs.shape, actions.shape, rates.shape, states.shape)
            #print(targetQs.dtype, actions.dtype, rates.dtype, states.dtype)
            gloss, dlossA, dlossQ, _, _, _ = sess.run([model.g_loss, model.d_lossA, model.d_lossQ, 
                                                       model.g_opt, model.d_optA, model.d_optQ],
                                                      feed_dict = {model.states: states, 
                                                                   model.actions: actions,
                                                                   model.targetQs: targetQs, 
                                                                   model.rates: rates})
            gloss_batch.append(gloss)
            dlossA_batch.append(dlossA)
            dlossQ_batch.append(dlossQ)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'rate:{:.4f}'.format(rate),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dlossA:{:.4f}'.format(np.mean(dlossA_batch)),
              'dlossQ:{:.4f}'.format(np.mean(dlossQ_batch)),
              'exploreP:{:.4f}'.format(explore_p))

        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        #gloss_list.append([ep, np.mean(gloss_batch)])
        #dloss_list.append([ep, np.mean(dloss_batch)])
        
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.
        if np.mean(episode_reward) >= 30:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:0.0000 R:0.000000 rate:0.0000 gloss:-360.5320 dlossA:0.1029 dlossQ:0.7023 exploreP:0.9057
Episode:1 meanR:0.2000 R:0.400000 rate:0.0133 gloss:-6540.2520 dlossA:0.1410 dlossQ:3.1096 exploreP:0.8204
Episode:2 meanR:0.2500 R:0.350000 rate:0.0117 gloss:-35057.0781 dlossA:0.2427 dlossQ:17.4989 exploreP:0.7432
Episode:3 meanR:0.3625 R:0.700000 rate:0.0233 gloss:-155546.0156 dlossA:0.4973 dlossQ:62.3151 exploreP:0.6734
Episode:4 meanR:0.4160 R:0.630000 rate:0.0210 gloss:94925016.0000 dlossA:1.6635 dlossQ:449.9792 exploreP:0.6102
Episode:5 meanR:0.3683 R:0.130000 rate:0.0043 gloss:102444536.0000 dlossA:0.8881 dlossQ:390.2564 exploreP:0.5530
Episode:6 meanR:0.3157 R:0.000000 rate:0.0000 gloss:236902992.0000 dlossA:1.4376 dlossQ:1050.8639 exploreP:0.5013
Episode:7 meanR:0.2762 R:0.000000 rate:0.0000 gloss:390146464.0000 dlossA:1.8746 dlossQ:2088.5886 exploreP:0.4545
Episode:8 meanR:0.2456 R:0.000000 rate:0.0000 gloss:380554112.0000 dlossA:1.8996 dlossQ:1815.8145 exploreP:0.4121
E

Episode:72 meanR:0.3627 R:0.430000 rate:0.0143 gloss:-2271539200.0000 dlossA:5.0604 dlossQ:3875.8528 exploreP:0.0107
Episode:73 meanR:0.3578 R:0.000000 rate:0.0000 gloss:-3902618624.0000 dlossA:7.0309 dlossQ:6254.6636 exploreP:0.0106
Episode:74 meanR:0.3545 R:0.110000 rate:0.0037 gloss:-1746834432.0000 dlossA:4.6028 dlossQ:3608.8792 exploreP:0.0105
Episode:75 meanR:0.3563 R:0.490000 rate:0.0163 gloss:-2199609088.0000 dlossA:6.3691 dlossQ:6253.6445 exploreP:0.0105
Episode:76 meanR:0.3539 R:0.170000 rate:0.0057 gloss:-779677568.0000 dlossA:5.4355 dlossQ:4214.3857 exploreP:0.0104
Episode:77 meanR:0.3494 R:0.000000 rate:0.0000 gloss:-5248725504.0000 dlossA:6.0075 dlossQ:5667.9585 exploreP:0.0104
Episode:78 meanR:0.3477 R:0.220000 rate:0.0073 gloss:-2173766912.0000 dlossA:7.8854 dlossQ:7706.0083 exploreP:0.0104
Episode:79 meanR:0.3439 R:0.040000 rate:0.0013 gloss:5401230848.0000 dlossA:6.3218 dlossQ:7259.2036 exploreP:0.0103
Episode:80 meanR:0.3415 R:0.150000 rate:0.0050 gloss:3746609152.00

Episode:141 meanR:0.3437 R:0.000000 rate:0.0000 gloss:-1459960020992.0000 dlossA:11.6220 dlossQ:100744.9297 exploreP:0.0100
Episode:142 meanR:0.3437 R:0.000000 rate:0.0000 gloss:-1289396551680.0000 dlossA:14.6310 dlossQ:124397.2734 exploreP:0.0100
Episode:143 meanR:0.3368 R:1.720000 rate:0.0573 gloss:-1261709426688.0000 dlossA:12.8929 dlossQ:88236.3359 exploreP:0.0100
Episode:144 meanR:0.3361 R:1.380000 rate:0.0460 gloss:-1641987178496.0000 dlossA:18.9428 dlossQ:224033.9531 exploreP:0.0100
Episode:145 meanR:0.3322 R:0.530000 rate:0.0177 gloss:-907476205568.0000 dlossA:13.2916 dlossQ:81702.0000 exploreP:0.0100
Episode:146 meanR:0.3345 R:0.230000 rate:0.0077 gloss:-1521188077568.0000 dlossA:17.1912 dlossQ:168987.5781 exploreP:0.0100
Episode:147 meanR:0.3424 R:1.390000 rate:0.0463 gloss:-2623072632832.0000 dlossA:20.1482 dlossQ:314162.3125 exploreP:0.0100
Episode:148 meanR:0.3493 R:1.030000 rate:0.0343 gloss:-489144844288.0000 dlossA:17.7789 dlossQ:132596.9531 exploreP:0.0100
Episode:149 

Episode:207 meanR:0.5868 R:0.540000 rate:0.0180 gloss:-41678371028992.0000 dlossA:40.4188 dlossQ:789435.6875 exploreP:0.0100
Episode:208 meanR:0.5970 R:1.590000 rate:0.0530 gloss:-19130937769984.0000 dlossA:60.5590 dlossQ:1448932.5000 exploreP:0.0100
Episode:209 meanR:0.6102 R:2.100000 rate:0.0700 gloss:-37359215181824.0000 dlossA:45.1508 dlossQ:772463.8750 exploreP:0.0100
Episode:210 meanR:0.6201 R:1.100000 rate:0.0367 gloss:-63519961645056.0000 dlossA:50.4826 dlossQ:1485068.1250 exploreP:0.0100
Episode:211 meanR:0.6159 R:0.480000 rate:0.0160 gloss:-68659695321088.0000 dlossA:58.6595 dlossQ:1917926.1250 exploreP:0.0100
Episode:212 meanR:0.6166 R:0.660000 rate:0.0220 gloss:-124812437487616.0000 dlossA:70.9000 dlossQ:2348899.5000 exploreP:0.0100
Episode:213 meanR:0.6305 R:1.390000 rate:0.0463 gloss:-323171391635456.0000 dlossA:66.9534 dlossQ:6394850.5000 exploreP:0.0100
Episode:214 meanR:0.6484 R:1.790000 rate:0.0597 gloss:-16533273182208.0000 dlossA:30.6029 dlossQ:315908.0625 exploreP:

Episode:272 meanR:0.9127 R:0.000000 rate:0.0000 gloss:-1175519932448768.0000 dlossA:102.0082 dlossQ:9815439.0000 exploreP:0.0100
Episode:273 meanR:0.8957 R:0.160000 rate:0.0053 gloss:-928980353941504.0000 dlossA:85.1738 dlossQ:5118488.5000 exploreP:0.0100
Episode:274 meanR:0.8905 R:0.070000 rate:0.0023 gloss:-543877077401600.0000 dlossA:87.9458 dlossQ:3970630.5000 exploreP:0.0100
Episode:275 meanR:0.8832 R:0.040000 rate:0.0013 gloss:-1371682363146240.0000 dlossA:138.5815 dlossQ:9219262.0000 exploreP:0.0100
Episode:276 meanR:0.8796 R:0.000000 rate:0.0000 gloss:-1067769269321728.0000 dlossA:75.9002 dlossQ:3085536.5000 exploreP:0.0100
Episode:277 meanR:0.8780 R:0.090000 rate:0.0030 gloss:-1599714122268672.0000 dlossA:150.9716 dlossQ:15240578.0000 exploreP:0.0100
Episode:278 meanR:0.8754 R:0.670000 rate:0.0223 gloss:-1262049430601728.0000 dlossA:85.3141 dlossQ:3987049.2500 exploreP:0.0100
Episode:279 meanR:0.8797 R:0.560000 rate:0.0187 gloss:4015576452169728.0000 dlossA:199.3711 dlossQ:182