# Continuous Control

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [1]:
!pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 2.0.9 which is incompatible.[0m


The environments corresponding to both versions of the environment are already saved in the Workspace and can be accessed at the file paths provided below.  

Please select one of the two options below for loading the environment.

In [2]:
from unityagents import UnityEnvironment
import numpy as np

# select this option to load version 1 (with a single agent) of the environment
env = UnityEnvironment(file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64')

# select this option to load version 2 (with 20 agents) of the environment
# env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')



INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
_action_size = brain.vector_action_space_size
print('Size of each action:', _action_size)

# examine the state space 
states = env_info.vector_observations
_state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], _state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [  0.00000000e+00  -4.00000000e+00   0.00000000e+00   1.00000000e+00
  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -1.00000000e+01   0.00000000e+00
   1.00000000e+00  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   5.75471878e+00  -1.00000000e+00
   5.55726671e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
  -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agents while they are training**, and you should set `train_mode=True` to restart the environment.

In [5]:

env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, _action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
print(" actions: {} \n next_states: {} \n rewards: {} \n dones:{}".format(actions, next_states, rewards,dones))

def sample_action():
    actions = np.random.randn(num_agents, _action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)   
    return actions

Total score (averaged over agents) this episode: 0.0
 actions: [[-1.          1.         -0.30612727 -1.        ]] 
 next_states: [[  0.00000000e+00  -4.00000000e+00   0.00000000e+00   1.00000000e+00
   -0.00000000e+00  -0.00000000e+00  -4.37113883e-08   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00  -1.00000000e+01   0.00000000e+00
    1.00000000e+00  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   7.90150642e+00  -1.00000000e+00
    1.25147498e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
   -2.99753308e-01]] 
 rewards: [0.0] 
 dones:[True]


When finished, you can close the environment.

In [6]:
#env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agents while they are training.  However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! 

## 5. The Project

#### 5.1 The tools

In [7]:
from collections import namedtuple, deque
import random

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, action_size, buffer_size, batch_size, seed, dev):
        """Initialize a ReplayBuffer object.
        Params
        ======
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
        """
        self.dev = dev
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)  # internal memory (deque)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)
    
    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)
    
    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(self.dev)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(self.dev)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(self.dev)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(self.dev)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(self.dev)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)
    

#### 5.2 The brains

In [8]:
import torch.nn as nn
import torch

class Actor(nn.Module):
    def __init__(self, input_size, output_size, layers=[400, 300]):
        super(Actor, self).__init__()
        self.layers = nn.ModuleList()
        pre_units = input_size
        for L in layers:
            self.layers.append(nn.Linear(pre_units, L))
            self.layers.append(nn.ReLU())
            pre_units = L
        self.final_linear = nn.Linear(pre_units, output_size)
        self.final_activation = nn.Tanh()
        self.reset_parameters()
        return
        
        
    def reset_parameters(self):            
        for layer in self.layers:
            if hasattr(layer,"weight"):
                nn.init.xavier_uniform_(layer.weight)
            
        nn.init.uniform_(self.final_linear.weight, -0.003, 0.003)   
        return
    
        
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        x = self.final_linear(x)
        x = self.final_activation(x)
        return x
    

class Critic(nn.Module):
    def __init__(self, state_size, act_size, output_size=1, 
                 act_layers=[200], 
                 state_layers=[400, 300], 
                 final_layers=[400, 300]):
        super(Critic, self).__init__()
        self.state_layers = nn.ModuleList()
        self.act_layers = nn.ModuleList()
        self.final_layers = nn.ModuleList()

        pre_units = state_size
        for L in state_layers:
            self.state_layers.append(nn.Linear(pre_units, L))
            self.state_layers.append(nn.LeakyReLU())
            pre_units = L
        final_state_column_size = pre_units
            
        pre_units = act_size
        for L in act_layers:
            self.act_layers.append(nn.Linear(pre_units, L))
            self.act_layers.append(nn.LeakyReLU())
            pre_units = L
        final_action_column_size = pre_units
            
        pre_units = final_state_column_size + final_action_column_size
        for L in final_layers:
            self.final_layers.append(nn.Linear(pre_units, L))
            self.final_layers.append(nn.LeakyReLU())
            pre_units = L        
            
        self.final_linear = nn.Linear(pre_units, output_size)
        self.reset_parameters()
        return
    
        
    def reset_parameters(self):
        for layer in self.state_layers:
            if hasattr(layer,"weight"):
                nn.init.xavier_uniform_(layer.weight)
            
        for layer in self.act_layers:
            if hasattr(layer,"weight"):
                nn.init.xavier_uniform_(layer.weight)
            
        for layer in self.final_layers:
            if hasattr(layer,"weight"):
                nn.init.xavier_uniform_(layer.weight)
            
        nn.init.uniform_(self.final_linear.weight, -0.003, 0.003)
        return
        
        
    def forward(self, state, action):
        x_state = state
        for layer in self.state_layers:
            x_state = layer(x_state)
            
        x_act = action
        for layer in self.act_layers:
            x_act = layer(x_act)
        
        x = torch.cat((x_state, x_act), dim=1)
        
        for layer in self.final_layers:
            x = layer(x)    
        
        x = self.final_linear(x)
        return x

act_test = Actor(input_size=_state_size, output_size=_action_size)
cri_test = Critic(state_size=_state_size, act_size=_action_size)
print("Actor DAG:\n{}".format(act_test))
print("Critic DAG:\n{}".format(cri_test))

act_f_layer = act_test.final_linear.weight.detach().numpy()
print(act_f_layer.min(), act_f_layer.max(), act_f_layer.mean())

    

Actor DAG:
Actor(
  (layers): ModuleList(
    (0): Linear(in_features=33, out_features=400, bias=True)
    (1): ReLU()
    (2): Linear(in_features=400, out_features=300, bias=True)
    (3): ReLU()
  )
  (final_linear): Linear(in_features=300, out_features=4, bias=True)
  (final_activation): Tanh()
)
Critic DAG:
Critic(
  (state_layers): ModuleList(
    (0): Linear(in_features=33, out_features=400, bias=True)
    (1): LeakyReLU(negative_slope=0.01)
    (2): Linear(in_features=400, out_features=300, bias=True)
    (3): LeakyReLU(negative_slope=0.01)
  )
  (act_layers): ModuleList(
    (0): Linear(in_features=4, out_features=200, bias=True)
    (1): LeakyReLU(negative_slope=0.01)
  )
  (final_layers): ModuleList(
    (0): Linear(in_features=500, out_features=400, bias=True)
    (1): LeakyReLU(negative_slope=0.01)
    (2): Linear(in_features=400, out_features=300, bias=True)
    (3): LeakyReLU(negative_slope=0.01)
  )
  (final_linear): Linear(in_features=300, out_features=1, bias=True)
)
-

#### 5.3 The agent

In [10]:

import torch.optim as optim
import torch.nn.functional as F

class Agent():
    """
    Implements DDPG TD3 approach with a few tricks
    """
    def __init__(self, a_size, s_size, dev, 
                 GAMMA=0.995, 
                 TAU=5e-3, 
                 policy_noise=0.2, 
                 exploration_noise=0.1, 
                 noise_clip=0.5,
                 LR_CRITIC=1e-2, 
                 LR_ACTOR=1e-2, 
                 WEIGHT_DECAY=1e-5, 
                 policy_freq=2, 
                 random_seed=1234,
                 BUFFER_SIZE=int(1e6), 
                 BATCH_SIZE=128, 
                 RANDOM_WARM_UP=1e4):
        self.a_size = a_size
        self.s_size = s_size
        self.dev = dev
        self.device = dev
        self.GAMMA = GAMMA
        self.TAU = 0.05
        self.RANDOM_WARM_UP = RANDOM_WARM_UP
        self.noise_clip = noise_clip
        self.exploration_noise = exploration_noise
        self.policy_noise = policy_noise
        self.policy_freq = policy_freq
        self.BATCH_SIZE = BATCH_SIZE
        self.actor_online = Actor(input_size=self.s_size, output_size=self.a_size).to(self.dev)
        self.actor_target = Actor(input_size=self.s_size, output_size=self.a_size).to(self.dev)
        self.actor_target.load_state_dict(self.actor_online.state_dict())
        self.actor_optimizer = optim.Adam(self.actor_online.parameters(), lr=LR_ACTOR)
        
        self.critic_online_1 = Critic(state_size=self.s_size, act_size=self.a_size).to(self.dev)
        self.critic_target_1 = Critic(state_size=self.s_size, act_size=self.a_size).to(self.dev)
        self.critic_target_1.load_state_dict(self.critic_online_1.state_dict())
        self.critic_1_optimizer = optim.Adam(self.critic_online_1.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY)
        
        self.critic_online_2 = Critic(state_size=self.s_size, act_size=self.a_size).to(self.dev)
        self.critic_target_2 = Critic(state_size=self.s_size, act_size=self.a_size).to(self.dev)
        self.critic_target_2.load_state_dict(self.critic_online_2.state_dict())
        self.critic_2_optimizer = optim.Adam(self.critic_online_2.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY)
        
        self.memory = ReplayBuffer(self.a_size, BUFFER_SIZE, BATCH_SIZE, random_seed, dev=self.dev)
        self.step_counter = 0
        self.steps_to_train_counter = 0
        self.skip_update_timer = 0
        self.train_iters = 0
        self.actor_updates = 0
        self.critic_1_losses = deque(maxlen=100)
        self.critic_2_losses = deque(maxlen=100)
        self.actor_losses = deque(maxlen=100)
        return
    

    def step(self, state, action, reward, next_state, done, train_every_steps):
        """Save experience in replay memory. train if required"""
        # Save experience / reward
        self.step_counter += 1
        self.memory.add(state, action, reward, next_state, done)
        
        if not self.is_warming_up():
            if self.steps_to_train_counter > 0:
                self.train(nr_iters=1)
                self.steps_to_train_counter -= 1
                self.skip_update_timer = 0
            else:
                self.skip_update_timer += 1

            if self.skip_update_timer >= train_every_steps:
                self.steps_to_train_counter = train_every_steps # // 2 # only half training
                self.skip_update_timer = 0            
        return


    def train(self, nr_iters):
        """ use random sample from buffer to learn """
        # Learn, if enough samples are available in memory
        if len(self.memory) > self.RANDOM_WARM_UP:
            for _ in range(nr_iters):
                experiences = self.memory.sample()
                self._train(experiences, self.GAMMA)    
        return
    
    def is_warming_up(self):
        return len(self.memory) < self.RANDOM_WARM_UP
    
    
    def act(self, state, add_noise=False):
        """Returns actions for given state as per current policy."""
        state = torch.from_numpy(state).float().to(self.device)
        self.actor_online.eval()
        with torch.no_grad():
            action = self.actor_online(state).cpu().data.numpy()
        self.actor_online.train()
        if add_noise:
            # we are obviously in training so now check if the "act" was called before warmpup
            assert not self.is_warming_up()
            noise = np.random.normal(loc=0, scale=self.exploration_noise, size=action.shape)
            action += noise
        return np.clip(action, -1, 1)
    
    
    def _train(self, experiences, gamma):
        """Update policy and value parameters using given batch of experience tuples.
        Q_targets = r + γ * critic_target(next_state, actor_target(next_state))
        where:
            actor_target(state) -> action
            critic_target(state, action) -> Q-value

        Params
        ======
            experiences: tuple of (s, a, r, s', done) tuples 
            gamma (float): discount factor
        """
        self.train_iters += 1
        if self.train_iters == 1:
            print("\nFirst training iter at step {}".format(self.step_counter))
        states, actions, rewards, next_states, dones = experiences
        actions_ = actions.cpu().numpy()
    
        actions_next = self.actor_target(next_states)
        noise = torch.FloatTensor(actions_).data.normal_(0, self.policy_noise).to(self.device)
        noise = torch.clamp(noise, -self.noise_clip, self.noise_clip)
        actions_next += noise
        actions_next = torch.clamp(actions_next, -1, 1)
        
        Q_targets_next_1 = self.critic_target_1(next_states, actions_next)
        Q_targets_next_2 = self.critic_target_2(next_states, actions_next)
        
        Q_targets_next = torch.min(Q_targets_next_1, Q_targets_next_2)
        
        # Compute Q targets for current states (y_i)
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones)).detach()

        # Compute critic loss 1
        Q_expected_1 = self.critic_online_1(states, actions)
        critic_1_loss = F.mse_loss(Q_expected_1, Q_targets)
        # Minimize the loss for critic 1
        self.critic_1_optimizer.zero_grad()
        critic_1_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.critic_online_1.parameters(), 1)
        self.critic_1_optimizer.step()
        self.np_loss_1 = critic_1_loss.detach().cpu().item()

        
        # Compute critic loss 2
        Q_expected_2 = self.critic_online_2(states, actions)
        critic_2_loss = F.mse_loss(Q_expected_2, Q_targets)
        # Minimize the loss for critic 2
        self.critic_2_optimizer.zero_grad()
        critic_2_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.critic_online_2.parameters(), 1)
        self.critic_2_optimizer.step()
        self.np_loss_2 = critic_2_loss.detach().cpu().item()
        
        self.critic_1_losses.append(self.np_loss_1)
        self.critic_2_losses.append(self.np_loss_2)
        
        if (self.train_iters % self.policy_freq) == 0:
            actions_pred = self.actor_online(states)
            actor_loss = -self.critic_online_1(states, actions_pred).mean()
            self.actor_optimizer.zero_grad()
            actor_loss.backward()
            self.actor_optimizer.step()
            self.np_loss_actor = actor_loss.detach().cpu().item()
            
            self.soft_update_actor()
            self.soft_update_critics()
            
            self.actor_losses.append(self.np_loss_actor)
            self.actor_updates += 1
            
        return

    def save(self, label):
        fn = 'actor_it_{:010}_{}.policy'.format(self.train_iters, label)
        torch.save(self.actor_online.state_dict(), fn)
        return

    
    def soft_update_actor(self):
        self._soft_update(self.actor_online, self.actor_target, self.TAU)
        return

        
    def soft_update_critics(self):
        self._soft_update(self.critic_online_1, self.critic_target_1, self.TAU)
        self._soft_update(self.critic_online_2, self.critic_target_2, self.TAU)
        return
        
        
    def _soft_update(self, local_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target
        Params
        ======
            local_model: PyTorch model (weights will be copied from)
            target_model: PyTorch model (weights will be copied to)
            tau (float): interpolation parameter 
        """
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)    
        return
    

#### 5.4 The loop

In [11]:
import time
import matplotlib.pyplot as plt

def training_loop(env, agent, n_episodes=300, max_t=10000, train_every_steps=10):
    print("Starting training for {} episodes...".format(n_episodes))
    scores_deque = deque(maxlen=100)
    steps_deque = deque(maxlen=100)
    scores = []
    ep_times = []
    for i_episode in range(1, n_episodes+1):
        t_start = time.time()
        env_info = env.reset(train_mode=True)[brain_name]
        states = env_info.vector_observations
        state = states[0]
        score = 0        
        for t in range(max_t):
            if agent.is_warming_up():
                action = sample_action()
            else:
                action = agent.act(state, add_noise=True)
            env_info = env.step(action)[brain_name]
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            next_state, reward, done = next_states[0], rewards[0], dones[0]
            agent.step(state, action, reward, next_state, done, train_every_steps=train_every_steps)
            state = next_state
            score += reward
            if done:
                break           
        if train_every_steps==0:
            agent.train(nr_iters=t//10)
        scores_deque.append(score)
        scores.append(score)
        steps_deque.append(t)
        t_end = time.time()
        ep_time = t_end - t_start
        ep_times.append(ep_time)
        _cl1 = np.mean(agent.critic_1_losses)
        _cl2 = np.mean(agent.critic_2_losses)
        _al = np.mean(agent.actor_losses)
        print('\rEpisode {:>4}  Score/Avg: {:>4.1f}/{:>4.1f}  Steps: {:>4}  [μcL1/μcL2: {:>8.1e}/{:>8.1e} μaL: {:>8.1e}]  t:{:>4.1f}s    '.format(
            i_episode, score, np.mean(scores_deque), t, _cl1,_cl2, _al, ep_time), end="", flush=True)
        if np.mean(scores_deque) > 30:
            print("\nEnvironment solved at episode {}!".format(i_episode))
            agent.save('ep_{}_solved'.format(i_episode))
            break
        if i_episode % 50 == 0:
            mean_ep = np.mean(ep_times)
            elapsed = i_episode * mean_ep
            total = (n_episodes + 1) * mean_ep
            left_time_hrs = (total - elapsed) / 3600            
            print('\rEpisode {:>4}  Score/Avg: {:>4.1f}/{:>4.1f}  AvStp: {:>4.0f}  [μcL1/μcL2: {:>8.1e}/{:>8.1e} μaL: {:>8.1e}]  t-left:{:>4.1f} h    '.format(
                i_episode, score, np.mean(scores_deque), np.mean(steps_deque), _cl1,_cl2, _al, left_time_hrs))
            print("  Loaded steps: {:>10}".format(agent.step_counter))
            print("  Train iters:  {:>10}".format(agent.train_iters))
            print("  Actor update: {:>10}".format(agent.actor_updates))
    return scores

from workspace_utils import active_session
 
with active_session():
    train_every_steps=10
    dev = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    agent = Agent(a_size=_action_size, s_size=_state_size, dev=dev)
    scores = training_loop(env=env, agent=agent, train_every_steps=train_every_steps)

    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.plot(np.arange(1, len(scores)+1), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()



Starting training for 300 episodes...
Episode    1  Score/Avg:  0.0/ 0.0  Steps: 1000  [μcL1/μcL2:      nan/     nan μaL:      nan]  t: 1.5s    

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Episode    9  Score/Avg:  0.0/ 0.2  Steps: 1000  [μcL1/μcL2:      nan/     nan μaL:      nan]  t: 1.4s    
First training iter at step 10010
Episode   12  Score/Avg:  0.0/ 0.2  Steps: 1000  [μcL1/μcL2:  6.6e-06/ 6.6e-06 μaL:  7.1e-03]  t:41.1s    

KeyboardInterrupt: 

In [None]:
print("Critic w-means: {}".format([x.detach().cpu().numpy().mean() for x in agent.critic_online_1.parameters()]))
print("Actor w-means: {}".format([x.detach().cpu().numpy().mean() for x in agent.actor_online.parameters()]))

In [None]:
len(agent.actor_losses)

In [None]:
len(agent.critic_1_losses)

In [None]:
len(agent.memory)