# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

In [1]:
from unityagents import UnityEnvironment
import numpy as np
import matplotlib.pyplot as plt

In [2]:
env = UnityEnvironment(file_name='Reacher-20.app')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=False)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agent(s). Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agent(s). Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
rewards = []
all_rewards=[]
for i in range(1,11):
    env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        dones = env_info.local_done                        # see if episode finished
        all_rewards.append(env_info.rewards[0])
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    rewards.append(scores)
    print('\rTotal score for this episode: {:.4f}, average score {:.4f}'.format(np.mean(scores),np.mean(rewards)),end='')
    if i % 20 ==0:
        print('\rTotal score for this episode: {:.4f}, average score {:.4f}'.format(np.mean(scores),np.mean(rewards)))

Total score for this episode: 0.1305, average score 0.1281

In [6]:
from collections import Counter
Counter(all_rewards)

Counter({0.0: 10000,
         0.009999999776482582: 2,
         0.03999999910593033: 7,
         0.019999999552965164: 1})

In [7]:
len(all_rewards)

10010

When finished, you can close the environment.

In [8]:
#env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [9]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import namedtuple, deque
import numpy as np
import random
import copy

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Using device {}'.format(device))

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.
        Params
        ======
            action_size (int): dimension of each action
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
            seed (int): random seed
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)  
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)
    
    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)
    
    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        
        experiences = random.sample(self.memory, k=self.batch_size)
        
        #Extract information from memory unit and return
        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return (-lim, lim)    
    
class Actor(nn.Module):
    
    def __init__(self, state_size, action_size, hidden=[400, 300]):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden[0])
        self.fc2 = nn.Linear(hidden[0], hidden[1])
        self.fc3 = nn.Linear(hidden[1], action_size)
        self.initialize()
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.tanh(self.fc3(x))
        return x
    
    def initialize(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc1.bias.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc2.bias.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)
        self.fc3.bias.data.uniform_(-3e-3, 3e-3)
        
class Critic1(nn.Module):
    
    def __init__(self, state_size, action_size, hidden=[400, 300]):
        super(Critic1, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden[0])
        self.fc2 = nn.Linear(hidden[0]+action_size, hidden[1])
        self.fc3 = nn.Linear(hidden[1], 1)
        self.initialize()
    
    def forward(self, state, action):
        x = F.relu(self.fc1(state))
        x = torch.cat([x, action], dim=1)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    
    def initialize(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc1.bias.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc2.bias.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-4, 3e-4)
        self.fc3.bias.data.uniform_(-3e-4, 3e-4)
        
class Critic2(nn.Module):
    
    def __init__(self, state_size, action_size, hidden=[400, 300]):
        super(Critic2, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden[0])
        self.fc2 = nn.Linear(hidden[0], hidden[1])
        self.fc3 = nn.Linear(hidden[1]+action_size, 1)
        self.initialize()
    
    def forward(self, state, action):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        x = torch.cat([x, action], dim=1)
        x = self.fc3(x)
        return x
    
    def initialize(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc1.bias.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc2.bias.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-4, 3e-4)
        self.fc3.bias.data.uniform_(-3e-4, 3e-4)
        
class OUNoise:
    """Ornstein-Uhlenbeck process."""

    def __init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):
        """Initialize parameters and noise process."""
        self.mu = mu * np.ones(size)
        self.theta = theta
        self.sigma = sigma
        self.seed = random.seed(seed)
        self.reset()

    def reset(self):
        """Reset the internal state (= noise) to mean (mu)."""
        self.state = copy.copy(self.mu)

    def sample(self):
        """Update internal state and return it as a noise sample."""
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.array([random.random() for i in range(len(x))])
        self.state = x + dx
        return self.state

Using device cpu


In [10]:
import torch
import torch.nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import random
from collections import deque

class DDPG_Agent:
    
    def __init__(self, env, critic, lr1=0.0001, lr2=0.001, tau=0.001, speed1=1, speed2=1,\
                 step=1, learning_time=1, batch_size=64):
        
        #Initialize environment
        brain_name = env.brain_names[0]
        brain = env.brains[brain_name]
        env_info = env.reset(train_mode=True)[brain_name]
        num_agents = len(env_info.agents)
        action_size = brain.vector_action_space_size
        states = env_info.vector_observations
        state_size = states.shape[1]
        self.env = env
        
        #Initialize some hyper parameters of agent
        self.lr1 = lr1
        self.lr2 = lr2
        self.tau = tau
        self.speed1 = speed1
        self.speed2 = speed2
        self.learning_time = learning_time
        self.state_size = state_size
        self.action_size = action_size
        self.num_agents = num_agents
        self.batch_size = batch_size
        self.gamma = 0.99
        self.step = step
        
        #Initialize agent (networks, replyabuffer and noise)
        self.actor_local = Actor(self.state_size, self.action_size).to(device)
        self.actor_target = Actor(self.state_size, self.action_size).to(device)
        if critic==1:
            self.critic_local = Critic1(self.state_size, self.action_size).to(device)
            self.critic_target = Critic1(self.state_size, self.action_size).to(device)
        else:
            self.critic_local = Critic2(self.state_size, self.action_size).to(device)
            self.critic_target = Critic2(self.state_size, self.action_size).to(device)
        self.soft_update(self.actor_local, self.actor_target, 1)
        self.soft_update(self.critic_local, self.critic_target, 1)
        self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=self.lr1)
        self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=self.lr2)
        self.memory = ReplayBuffer(self.action_size, buffer_size=int(1e6), batch_size=self.batch_size,\
                                   seed=random.randint(1, self.batch_size))
        self.noise = OUNoise(size=self.action_size, seed=random.randint(1, self.batch_size))
        
    def act(self, state):
        state = torch.tensor(state, dtype=torch.float)
        action = self.actor_local(state).detach().numpy()
        noise = self.noise.sample()
        action += noise
        action = np.clip(action, -1, 1)
        return action
        
    def soft_update(self, local_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target
        Params
        ======
            local_model: PyTorch model (weights will be copied from)
            target_model: PyTorch model (weights will be copied to)
            tau (float): interpolation parameter 
        """
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)
            
    def learn(self):
        
        experiences = self.memory.sample()
        states, actions, scores, next_states, dones = experiences
        
        expected_rewards = scores + (1-dones)*self.gamma*self.critic_target(next_states, self.actor_target(states))
        
        for _ in range(self.speed1):
            observed_rewards = self.critic_local(states, actions)
            L = F.mse_loss(expected_rewards, observed_rewards)
            self.critic_optimizer.zero_grad()
            L.backward()
            self.critic_optimizer.step()
            del L
        
        for _ in range(self.speed2):
            L = -self.critic_local(states, self.actor_local(states)).mean()
            self.actor_optimizer.zero_grad()
            L.backward()
            self.actor_optimizer.step()
            del L
        
        self.soft_update(self.actor_local, self.actor_target, self.tau)
        self.soft_update(self.critic_local, self.critic_target, self.tau)
        
            
    def train(self, n_episodes):
        rewards = []
        brain_name = self.env.brain_names[0]
        score_window = deque(maxlen=100)
        
        for i in range(1, n_episodes+1):
            episodic_reward = np.zeros(self.num_agents)
            env_info = self.env.reset(train_mode=True)[brain_name]
            states = env_info.vector_observations
            actions = self.act(states)
            t=0
            
            while True:
                env_info = self.env.step(actions)[brain_name]
                next_states = env_info.vector_observations
                dones = env_info.local_done
                scores = env_info.rewards
                episodic_reward += np.array(scores)
                rewards.append(scores)
                for state, action, score, next_state, done in zip(states, actions, scores, next_states, dones):
                    self.memory.add(state, action, score, next_state, done)
                t += 1
                
                if len(self.memory.memory)>self.batch_size:
                    if t % self.step == 0:
                        for _ in range(self.learning_time):
                            self.learn()
                            
                states = next_states
                actions = self.act(states)
            
            print('\rTotal score for this episode: {:.4f}, average score {:.4f}'.format(np.mean(episodic_reward),np.mean(score_window)),end='')
            if i % 100 == 0:
                print('\n')
                self.actor_local.cpu()
                self.critic_local.cpu()
                self.actor_target.cpu()
                self.critic_target.cpu()
                torch.save(self.actor_local.state_dict(),'./offline/offline_actor_checkpoint_{}.pth'.format(i))
                torch.save(self.critic_local.state_dict(),'./offline/offline_critic_checkpoint_{}.pth'.format(i))
                torch.save(self.actor_target.state_dict(),'./offline/offline_actor_target_checkpoint_{}.pth'.format(i))
                torch.save(self.critic_target.state_dict(),'./offline/offline_critic_target_checkpoint_{}.pth'.format(i))
                self.actor_local.to(device)
                self.critic_local.to(device)
                self.actor_target.to(device)
                self.critic_target.to(device)
        
        np.save(rewards, './offline/offline_rewards')
            

# Test agent initialization

In [11]:
agent = DDPG_Agent(env=env, critic=1)
#agent.train(1000)

## Check weights

In [12]:
agent.actor_local

Actor(
  (fc1): Linear(in_features=33, out_features=400, bias=True)
  (fc2): Linear(in_features=400, out_features=300, bias=True)
  (fc3): Linear(in_features=300, out_features=4, bias=True)
)

In [13]:
agent.critic_local

Critic1(
  (fc1): Linear(in_features=33, out_features=400, bias=True)
  (fc2): Linear(in_features=404, out_features=300, bias=True)
  (fc3): Linear(in_features=300, out_features=1, bias=True)
)

In [14]:
agent.actor_local.state_dict()

OrderedDict([('fc1.weight', tensor(1.00000e-02 *
                     [[-4.5942,  0.5654,  1.6422,  ..., -0.1278,  0.4840, -4.6325],
                      [ 3.9297,  0.3001,  1.9157,  ...,  4.2065,  2.9815,  3.4300],
                      [ 2.5130,  4.0031,  3.2086,  ...,  2.6951, -4.9408,  4.2580],
                      ...,
                      [ 1.3095,  4.8109,  1.8742,  ..., -1.5036, -3.0015, -2.3885],
                      [-1.3391,  4.1515,  0.9097,  ...,  3.6223,  3.0746, -1.0450],
                      [-1.5699, -0.3754,  2.5977,  ..., -1.7433, -4.3204,  1.9830]])),
             ('fc1.bias', tensor(1.00000e-02 *
                     [-1.2436, -3.5379, -4.0252,  0.2777, -4.5988, -0.1257, -3.5645,
                      -2.0414, -4.3601,  2.8404,  3.9854,  0.2118,  0.8501, -4.2180,
                      -1.5386, -0.5928, -0.5361,  4.6110, -3.4157,  4.0238, -1.7145,
                       1.9377, -4.4953,  2.9644,  1.7301,  2.6160,  3.6687, -2.7337,
                      -2.0656,

In [15]:
agent.actor_target.state_dict()

OrderedDict([('fc1.weight', tensor(1.00000e-02 *
                     [[-4.5942,  0.5654,  1.6422,  ..., -0.1278,  0.4840, -4.6325],
                      [ 3.9297,  0.3001,  1.9157,  ...,  4.2065,  2.9815,  3.4300],
                      [ 2.5130,  4.0031,  3.2086,  ...,  2.6951, -4.9408,  4.2580],
                      ...,
                      [ 1.3095,  4.8109,  1.8742,  ..., -1.5036, -3.0015, -2.3885],
                      [-1.3391,  4.1515,  0.9097,  ...,  3.6223,  3.0746, -1.0450],
                      [-1.5699, -0.3754,  2.5977,  ..., -1.7433, -4.3204,  1.9830]])),
             ('fc1.bias', tensor(1.00000e-02 *
                     [-1.2436, -3.5379, -4.0252,  0.2777, -4.5988, -0.1257, -3.5645,
                      -2.0414, -4.3601,  2.8404,  3.9854,  0.2118,  0.8501, -4.2180,
                      -1.5386, -0.5928, -0.5361,  4.6110, -3.4157,  4.0238, -1.7145,
                       1.9377, -4.4953,  2.9644,  1.7301,  2.6160,  3.6687, -2.7337,
                      -2.0656,

In [16]:
agent.critic_local.state_dict()

OrderedDict([('fc1.weight',
              tensor([[-3.8205e-02, -3.7120e-02,  3.7576e-02,  ..., -1.8511e-02,
                        1.6317e-02, -6.6376e-03],
                      [-2.7279e-02, -4.5264e-02, -4.0304e-02,  ...,  5.1037e-03,
                        9.3699e-05,  3.0535e-02],
                      [ 4.7254e-02,  8.9614e-03, -3.3254e-02,  ...,  4.4166e-02,
                        3.7473e-02, -8.4104e-03],
                      ...,
                      [-1.6511e-02, -1.0751e-02, -3.1465e-02,  ...,  3.1292e-02,
                        2.1713e-02,  3.3477e-02],
                      [-2.5592e-02, -7.0164e-03, -7.4934e-03,  ...,  4.0416e-02,
                        1.3999e-02,  1.9248e-02],
                      [-2.1481e-02, -1.2877e-02, -4.8572e-02,  ..., -3.8784e-02,
                       -5.9247e-03, -4.8265e-02]])),
             ('fc1.bias', tensor(1.00000e-02 *
                     [ 2.4062, -4.2299,  3.5407,  2.6798,  4.1978,  3.9811,  1.8449,
                      -2

In [17]:
agent.critic_target.state_dict()

OrderedDict([('fc1.weight',
              tensor([[-3.8205e-02, -3.7120e-02,  3.7576e-02,  ..., -1.8511e-02,
                        1.6317e-02, -6.6376e-03],
                      [-2.7279e-02, -4.5264e-02, -4.0304e-02,  ...,  5.1037e-03,
                        9.3699e-05,  3.0535e-02],
                      [ 4.7254e-02,  8.9614e-03, -3.3254e-02,  ...,  4.4166e-02,
                        3.7473e-02, -8.4104e-03],
                      ...,
                      [-1.6511e-02, -1.0751e-02, -3.1465e-02,  ...,  3.1292e-02,
                        2.1713e-02,  3.3477e-02],
                      [-2.5592e-02, -7.0164e-03, -7.4934e-03,  ...,  4.0416e-02,
                        1.3999e-02,  1.9248e-02],
                      [-2.1481e-02, -1.2877e-02, -4.8572e-02,  ..., -3.8784e-02,
                       -5.9247e-03, -4.8265e-02]])),
             ('fc1.bias', tensor(1.00000e-02 *
                     [ 2.4062, -4.2299,  3.5407,  2.6798,  4.1978,  3.9811,  1.8449,
                      -2

## Check history recording

In the following several chunks I tested whether the code works correctly in saving experiences into ReplayBuffer. For agent $k$ at time $t$, the state will be a 33 dimension vector sharing the same value $k+0.01*t$. The action is the same as state, but the dimension is 4. The reward will be $0.01*t+10*k$ and the next state uses the value $k+0.01*(t+1)$. Dimension of reward is 1. Dimension of next state is 33. $1\le t\le 100$

In [18]:
t = 1
env_info = agent.env.reset(train_mode=True)[brain_name]
states = np.cumsum(np.ones((agent.num_agents, agent.state_size)), axis=0)+np.ones((agent.num_agents, agent.state_size))*0.01*t
actions = np.cumsum(np.ones((agent.num_agents, agent.action_size)), axis=0)+np.ones((agent.num_agents, agent.action_size))*0.01*t
while t<=100:
    t += 1
    next_states = np.cumsum(np.ones((agent.num_agents, agent.state_size)), axis=0)+np.ones((agent.num_agents, agent.state_size))*0.01*t
    dones = env_info.local_done
    scores = np.ones(agent.num_agents)*0.01*t+np.cumsum(np.ones(agent.num_agents))*10
    for state, action, score, next_state, done in zip(states, actions, scores, next_states, dones):
        agent.memory.add(state, action, score, next_state, done)
    if len(agent.memory.memory)>agent.batch_size:
        if t % agent.step == 0:
            for _ in range(agent.learning_time):
                None                
    states = next_states
    actions = np.cumsum(np.ones((agent.num_agents, agent.action_size)), axis=0)+np.ones((agent.num_agents, agent.action_size))*0.01*t

In [19]:
agent.memory.__len__()

2000

In [20]:
agent.memory.memory

deque([Experience(state=array([1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01,
              1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01,
              1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01, 1.01]), action=array([1.01, 1.01, 1.01, 1.01]), reward=10.02, next_state=array([1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02,
              1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02,
              1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02, 1.02]), done=False),
       Experience(state=array([2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01,
              2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01,
              2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01, 2.01]), action=array([2.01, 2.01, 2.01, 2.01]), reward=20.02, next_state=array([2.02, 2.02, 2.02, 2.02, 2.02, 2.02, 2.02, 2.02, 2.02, 2.02, 2.02,
              2.02, 2.02, 2.02,

## Test learning process

In [21]:
experiences = agent.memory.sample()
states, actions, scores, next_states, dones = experiences

In [22]:
states[:5]

tensor([[  6.0200,   6.0200,   6.0200,   6.0200,   6.0200,   6.0200,
           6.0200,   6.0200,   6.0200,   6.0200,   6.0200,   6.0200,
           6.0200,   6.0200,   6.0200,   6.0200,   6.0200,   6.0200,
           6.0200,   6.0200,   6.0200,   6.0200,   6.0200,   6.0200,
           6.0200,   6.0200,   6.0200,   6.0200,   6.0200,   6.0200,
           6.0200,   6.0200,   6.0200],
        [  2.4900,   2.4900,   2.4900,   2.4900,   2.4900,   2.4900,
           2.4900,   2.4900,   2.4900,   2.4900,   2.4900,   2.4900,
           2.4900,   2.4900,   2.4900,   2.4900,   2.4900,   2.4900,
           2.4900,   2.4900,   2.4900,   2.4900,   2.4900,   2.4900,
           2.4900,   2.4900,   2.4900,   2.4900,   2.4900,   2.4900,
           2.4900,   2.4900,   2.4900],
        [ 11.1200,  11.1200,  11.1200,  11.1200,  11.1200,  11.1200,
          11.1200,  11.1200,  11.1200,  11.1200,  11.1200,  11.1200,
          11.1200,  11.1200,  11.1200,  11.1200,  11.1200,  11.1200,
          11.1200,  11.

In [23]:
actions[:5]

tensor([[  6.0200,   6.0200,   6.0200,   6.0200],
        [  2.4900,   2.4900,   2.4900,   2.4900],
        [ 11.1200,  11.1200,  11.1200,  11.1200],
        [  5.7900,   5.7900,   5.7900,   5.7900],
        [  5.4100,   5.4100,   5.4100,   5.4100]])

In [24]:
scores[:5]

tensor([[  60.0300],
        [  20.5000],
        [ 110.1300],
        [  50.8000],
        [  50.4200]])

In [25]:
next_states[:5]

tensor([[  6.0300,   6.0300,   6.0300,   6.0300,   6.0300,   6.0300,
           6.0300,   6.0300,   6.0300,   6.0300,   6.0300,   6.0300,
           6.0300,   6.0300,   6.0300,   6.0300,   6.0300,   6.0300,
           6.0300,   6.0300,   6.0300,   6.0300,   6.0300,   6.0300,
           6.0300,   6.0300,   6.0300,   6.0300,   6.0300,   6.0300,
           6.0300,   6.0300,   6.0300],
        [  2.5000,   2.5000,   2.5000,   2.5000,   2.5000,   2.5000,
           2.5000,   2.5000,   2.5000,   2.5000,   2.5000,   2.5000,
           2.5000,   2.5000,   2.5000,   2.5000,   2.5000,   2.5000,
           2.5000,   2.5000,   2.5000,   2.5000,   2.5000,   2.5000,
           2.5000,   2.5000,   2.5000,   2.5000,   2.5000,   2.5000,
           2.5000,   2.5000,   2.5000],
        [ 11.1300,  11.1300,  11.1300,  11.1300,  11.1300,  11.1300,
          11.1300,  11.1300,  11.1300,  11.1300,  11.1300,  11.1300,
          11.1300,  11.1300,  11.1300,  11.1300,  11.1300,  11.1300,
          11.1300,  11.

In [26]:
dones[:5]

tensor([[ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.]])

In [27]:
expected_rewards = scores + (1-dones)*agent.gamma*agent.critic_target(next_states, agent.actor_target(states))
observed_rewards = agent.critic_local(states, actions)
L = F.mse_loss(expected_rewards, observed_rewards)
L

tensor(14789.6426)

In [28]:
L = -agent.critic_local(states, agent.actor_local(states)).mean()
L

tensor(1.00000e-03 *
       2.3899)