# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
#env = UnityEnvironment(file_name='C:\\Users\\kvjos\\udacityRL\\deep-reinforcement-learning\\p2_continuous-control\\Reacher_Windows_x86_64/Reacher.exe')
env = UnityEnvironment(file_name='Reacher.app')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
for i in range(1000):
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.13799999691545964


When finished, you can close the environment.

In [8]:
terms = np.array([1 if t else 0 for t in env_info.local_done])

In [9]:
all_rewards = np.zeros(20)
episode_rewards = []
for i, terminal in enumerate(terms):
                if terms[i]:
                    self.episode_rewards.append(self.all_rewards[i])
                    self.all_rewards[i] = 0

In [10]:
all_rewards

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.])

In [11]:
episode_rewards

[]

In [None]:
env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [None]:
## Try just understanding the training segment of PPO


In [None]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
for i in range(1000):
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

In [64]:
rollout = []
rollout_length = 60
env_info = env.reset(train_mode=True)[brain_name]    
states = env_info.vector_observations
for _ in range(rollout_length):
    actions, log_probs, _, values = policy(states)
    env_info = env.step(actions.cpu().detach().numpy())[brain_name]
    next_states = env_info.vector_observations
    rewards = env_info.rewards
    terminals = np.array([1 if t else 0 for t in env_info.local_done])
    #self.all_rewards += rewards
            
    #for i, terminal in enumerate(terminals):
        #if terminals[i]:
            #self.episode_rewards.append(self.all_rewards[i])
            #self.all_rewards[i] = 0
    
    rollout.append([states, values.detach(), actions.detach(), log_probs.detach(), rewards, 1 - terminals])
    states = next_states

pending_value = policy(states)[-1]
rollout.append([states, pending_value, None, None, None, None])


In [65]:
for i in range(rollout_length):
    print(rollout[i][4])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 

In [69]:
processed_rollout = [None] * (len(rollout) - 1)
advantages = torch.Tensor(np.zeros((20, 1)))
returns = pending_value.detach()

In [70]:
returns

tensor([[-0.2105],
        [-0.2466],
        [-0.1758],
        [-0.2399],
        [-0.1876],
        [-0.1743],
        [-0.0614],
        [-0.2716],
        [-0.6308],
        [-0.1964],
        [-0.1475],
        [-0.1580],
        [-0.1890],
        [ 0.0733],
        [-0.3741],
        [-0.0109],
        [-0.1138],
        [-0.3361],
        [-0.3716],
        [-0.2445]])

In [75]:
for i in reversed(range(len(rollout) - 1)):
    states, value, actions, log_probs, rewards, terminals = rollout[i]
#    print("Rewards " + str(i) + str(rewards))
#    print(torch.Tensor(rewards).unsqueeze(1))
     terminals = torch.Tensor(terminals).unsqueeze(1)
     rewards = torch.Tensor(rewards).unsqueeze(1)
    actions = torch.Tensor(actions)
     states = torch.Tensor(states)
        next_value = rollout[i + 1][1]
#     returns = rewards + hyperparameters['discount_rate'] * terminals * returns

#     td_error = rewards + hyperparameters['discount_rate'] * terminals * next_value.detach() - value.detach()
#     advantages = advantages * hyperparameters['tau'] * hyperparameters['discount_rate'] * terminals + td_error
#     processed_rollout[i] = [states, actions, log_probs, returns, advantages]



Rewards 59[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
tensor([[ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.]])
Rewards 58[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
tensor([[ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.]])
Rewards 57[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
tensor([[ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
     

In [None]:
some_dist = torch.distributions.Normal(0,1)

In [None]:
    processed_rollout = [None] * (len(rollout) - 1)
        advantages = torch.Tensor(np.zeros((self.config['environment']['number_of_agents'], 1)))
        returns = pending_value.detach()
        
        states, actions, log_probs_old, returns, advantages = map(lambda x: torch.cat(x, dim=0), zip(*processed_rollout))

In [14]:
torch.randn(1)

tensor([ 0.8027])

In [28]:
import torch
import torch.nn as nn
import torch.nn.functional as F

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class Policy(nn.Module):

    def __init__(self):
        super(Policy, self).__init__()
     
        self.size = 33
        self.fc1 = nn.Linear(self.size, 512)
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, 4)
        
        self.critic_fc1 = nn.Linear(self.size,512)
        self.critic_fc2 = nn.Linear(512,128)
        self.critic_fc3 = nn.Linear(128,1)
        self.normalDistParams = torch.ones((1,4),device=device)
        
    def forward(self, x,sampled_actions=None):
        a  = torch.Tensor(x)
        a = F.relu(self.fc1(a))
        a = F.relu(self.fc2(a))
        a = F.tanh(self.fc3(a))
        
        v = torch.Tensor(x)
        v = F.relu(self.critic_fc1(v))
        v = F.relu(self.critic_fc2(v))
        v = self.critic_fc3(v)
        
       
        #x is now the mean of a normal distribution from which we
        #sample the actual action values
        prob_dists = torch.distributions.Normal(a,self.normalDistParams)
        if sampled_actions is None:
            sampled_actions = prob_dists.sample()
        action_probabilities = prob_dists.log_prob(sampled_actions) #Prob of each individual action
        summed_action_probabilities = torch.sum(action_probabilities,dim=-1,keepdim=True)
        return sampled_actions,summed_action_probabilities,x,v
        


# run your own policy!
policy=Policy().to(device)
#policy=pong_utils.Policy().to(device)

# we use the adam optimizer with learning rate 2e-4
# optim.SGD is also possible
import torch.optim as optim
optimizer = optim.Adam(policy.parameters(), lr=1e-4)

In [29]:
policy

Policy(
  (fc1): Linear(in_features=33, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=512, bias=True)
  (fc3): Linear(in_features=512, out_features=4, bias=True)
  (critic_fc1): Linear(in_features=33, out_features=512, bias=True)
  (critic_fc2): Linear(in_features=512, out_features=128, bias=True)
  (critic_fc3): Linear(in_features=128, out_features=1, bias=True)
)

In [30]:
test_obs = torch.tensor(env_info.vector_observations,dtype=torch.float, device=device)
test_actions,test_probs,dist_means,test_val = policy(test_obs)

In [31]:
test_val

tensor([[-0.4196],
        [-0.2368],
        [ 0.0104],
        [-0.2756],
        [-0.1634],
        [-0.3744],
        [-0.4036],
        [-0.3954],
        [-0.2302],
        [-0.2640],
        [-0.0769],
        [-0.4452],
        [-0.3922],
        [ 0.0270],
        [-0.2938],
        [-0.2137],
        [-0.0689],
        [-0.2956],
        [-0.0984],
        [ 0.0443]])

In [11]:
test_probs

tensor([[-8.7777],
        [-4.2433],
        [-4.0453],
        [-6.0767],
        [-4.3547],
        [-9.4517],
        [-4.5175],
        [-5.3599],
        [-6.1033],
        [-4.9137],
        [-7.6580],
        [-5.1190],
        [-5.2781],
        [-4.1189],
        [-5.9885],
        [-6.4951],
        [-5.6233],
        [-7.9179],
        [-5.3300],
        [-7.5236]], device='cuda:0', grad_fn=<SumBackward1>)

In [14]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
for i in range(1000):
    #actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    states_tensor = torch.tensor(states,dtype=torch.float,device=device)
    actions,act_probs,act_means = policy(states_tensor)
    actions = actions.squeeze().cpu().detach().numpy()
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.17549999607726932


In [12]:
def collect_trajectories(env,policy,tmax=200,nrand=5):
    # number of parallel instances
    #env_info = env.reset(train_mode=False)[brain_name]
    #n=len(env_info.agents)

    #initialize returning lists and start the game!
    state_list=[]
    reward_list=[]
    prob_list=[]
    action_list=[]

    # start all parallel agents
    #envs.step([1]*n)
    
    # perform nrand random steps
    for _ in range(nrand):
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)   
        env_info = env.step(actions)[brain_name]
    
    for t in range(tmax):
        states_tensor = torch.tensor(env_info.vector_observations,dtype=torch.float,device=device)
        actions,probs,action_means = policy(states_tensor)
        
        actions = actions.squeeze().cpu().detach().numpy()
        probs = probs.squeeze().cpu().detach().numpy()
        action_means = action_means.squeeze().cpu().detach().numpy()
        
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        states = next_states                               # roll over states to next time step
        
        state_list.append(states_tensor)
        reward_list.append(rewards)
        prob_list.append(probs)
        action_list.append(actions)
        
        if np.any(dones):                                  # exit loop if episode finished
            break

    return prob_list, state_list, \
        action_list, reward_list

In [13]:
test_probs, test_states, test_actions, test_rewards = \
        collect_trajectories(env, policy, tmax=37)

In [14]:
def calculate_normalized_rewards(rewards,discount=0.995):
    discounts = discount**np.arange(len(rewards))
    discounts_reshaped = discounts[:,np.newaxis]
    rewards = np.asarray(rewards)
    total_rewards = rewards*discounts_reshaped
    #print(rewards)
    #print(total_rewards)
    future_rewards = np.flipud(np.cumsum(np.flipud(total_rewards),axis=0))
    #print("Future rewards length = " + str(len(future_rewards)))
    
    rewards_mean = np.mean(future_rewards,axis=1)
    #print(rewards_mean)
    rewards_std = np.std(future_rewards,axis=1) + 1e-10
    rewards_normalized = (future_rewards-rewards_mean[:,np.newaxis])/(rewards_std[:,np.newaxis])
    #print(rewards_normalized)
    #print("Rewards normalize dlength : " + str(len(rewards_normalized)) + " " + str(len(rewards_normalized[0])))
    return rewards_normalized

In [15]:
print(test_rewards)
calculate_normalized_rewards(test_rewards)

[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 

array([[-0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573,
        -0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573,
        -0.22941573, -0.22941573, -0.22941573, -0.22941573,  4.35889894,
        -0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573],
       [-0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573,
        -0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573,
        -0.22941573, -0.22941573, -0.22941573, -0.22941573,  4.35889894,
        -0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573],
       [-0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573,
        -0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573,
        -0.22941573, -0.22941573, -0.22941573, -0.22941573,  4.35889894,
        -0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573],
       [-0.22941573, -0.22941573, -0.22941573, -0.22941573, -0.22941573,
        -0.22941573, -0.22941573, -0.22941573, -

In [16]:
def clipped_surrogate(policy, old_probs, states, actions, rewards,
                      discount = 0.995, epsilon=0.1, beta=0.01):

    rewards_normalized = calculate_normalized_rewards(rewards)
    
    actions = torch.tensor(actions, dtype=torch.float, device=device)
    old_probs = torch.tensor(old_probs, dtype=torch.float, device=device)
    rewards = torch.tensor(rewards_normalized, dtype=torch.float, device=device)
     
    
    # convert states to policy (or probability)
    _,new_probs,_ = policy(torch.stack(states),actions)
    new_probs = new_probs.squeeze()

    #print(new_probs.shape)
    #print(old_probs.shape)
    
    
    #new_probs = torch.where(actions == pong_utils.RIGHT, new_probs, 1.0-new_probs)
    
    ratio = new_probs/old_probs
    ratio = ratio.exp()
    #print(ratio)
    clipped_ratio = torch.clamp(ratio, 1-epsilon, 1+epsilon)
    clipped_surrogate = torch.min(ratio*rewards, clipped_ratio*rewards)
    #print(clipped_surrogate)
    #print(old_probs)
    #print(torch.log(old_probs+1.e-10))
    # include a regularization term
    # this steers new_policy towards 0.5
    # prevents policy to become exactly 0 or 1 helps exploration
    # add in 1.e-10 to avoid log(0) which gives nan
    entropy = -(new_probs.exp()*torch.log(old_probs.exp()+1.e-10)+ (1.0-new_probs.exp())*torch.log(1.0-old_probs.exp()+1.e-10))
    #print(entropy)
    #return torch.mean(clipped_surrogate + beta*entropy)
    return torch.mean(clipped_surrogate)
    #return torch.mean(clipped_surrogate + beta*entropy)

In [17]:
policy=Policy().to(device)
#policy=pong_utils.Policy().to(device)

# we use the adam optimizer with learning rate 2e-4
# optim.SGD is also possible
import torch.optim as optim
optimizer = optim.Adam(policy.parameters(), lr=5e-4)

In [18]:
clipped_surrogate(policy,test_probs,test_states,test_actions,test_rewards)

tensor(-0.2559, device='cuda:0', grad_fn=<MeanBackward1>)

In [19]:
import numpy as np
# keep track of how long training takes
# WARNING: running through all 800 episodes will take 30-45 minutes

# training loop max iterations
episode = 300

# widget bar to display progress
!pip install progressbar
import progressbar as pb
widget = ['training loop: ', pb.Percentage(), ' ', 
          pb.Bar(), ' ', pb.ETA() ]
timer = pb.ProgressBar(widgets=widget, maxval=episode).start()

discount_rate = .99
epsilon = 0.2
beta = .01
tmax = 1000
SGD_epoch = 4

# keep track of progress
mean_rewards = []

for e in range(episode):
    env_info = env.reset(train_mode=True)[brain_name]   
    # collect trajectories
    old_probs, states, actions, rewards = \
        collect_trajectories(env, policy, tmax=tmax)
        
    total_rewards = np.sum(rewards, axis=0)


    # gradient ascent step
    for _ in range(SGD_epoch):
        
        # uncomment to utilize your own clipped function!
        # L = -clipped_surrogate(policy, old_probs, states, actions, rewards, epsilon=epsilon, beta=beta)

        L = -clipped_surrogate(policy, old_probs, states, actions, rewards,
                                          epsilon=epsilon, beta=beta)
        print(L)
        optimizer.zero_grad()
        L.backward()
        optimizer.step()
        del L
    
    # the clipping parameter reduces as time goes on
    epsilon*=.999
    
    # the regulation term also reduces
    # this reduces exploration in later runs
    beta*=.995
    
    # get the average reward of the parallel environments
    mean_rewards.append(np.mean(total_rewards))
    
    # display some progress every 20 iterations
    if (e+1)%10 ==0 :
        print("Episode: {0:d}, score: {1:f}".format(e+1,np.mean(total_rewards)))
        print(total_rewards)
        
    # update progress widget bar
    timer.update(e+1)
    
timer.finish()

mkl-random 1.0.1 requires cython, which is not installed.
You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
training loop:   0% |                                          | ETA:  --:--:--



training loop:   0% |                                           | ETA:  0:18:03

tensor(0.4070, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4217, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4109, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4173, device='cuda:0', grad_fn=<NegBackward>)


training loop:   0% |                                           | ETA:  0:17:49

tensor(0.4491, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4505, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4508, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4494, device='cuda:0', grad_fn=<NegBackward>)


training loop:   1% |                                           | ETA:  0:17:45

tensor(0.4551, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4555, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4558, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4553, device='cuda:0', grad_fn=<NegBackward>)


training loop:   1% |                                           | ETA:  0:17:39

tensor(0.4493, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4495, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4495, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4492, device='cuda:0', grad_fn=<NegBackward>)


training loop:   1% |                                           | ETA:  0:17:34

tensor(0.3975, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3975, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3975, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3974, device='cuda:0', grad_fn=<NegBackward>)


training loop:   2% |                                           | ETA:  0:17:31

tensor(0.4836, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4835, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4834, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4832, device='cuda:0', grad_fn=<NegBackward>)


training loop:   2% |#                                          | ETA:  0:17:28

tensor(0.3873, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3872, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3872, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3871, device='cuda:0', grad_fn=<NegBackward>)


training loop:   2% |#                                          | ETA:  0:17:24

tensor(0.3947, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3946, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3945, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3944, device='cuda:0', grad_fn=<NegBackward>)


training loop:   3% |#                                          | ETA:  0:17:21

tensor(0.4237, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4237, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4236, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4235, device='cuda:0', grad_fn=<NegBackward>)


training loop:   3% |#                                          | ETA:  0:17:17

tensor(0.2539, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2539, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2538, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2537, device='cuda:0', grad_fn=<NegBackward>)
Episode: 10, score: 0.097000
[0.         0.         0.         0.         0.         0.84999998
 0.         0.         0.         0.17       0.         0.
 0.15       0.         0.51999999 0.         0.         0.24999999
 0.         0.        ]


training loop:   3% |#                                          | ETA:  0:17:14

tensor(0.1468, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1468, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1468, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1467, device='cuda:0', grad_fn=<NegBackward>)


training loop:   4% |#                                          | ETA:  0:17:11

tensor(0.4454, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4453, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4451, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4449, device='cuda:0', grad_fn=<NegBackward>)


training loop:   4% |#                                          | ETA:  0:17:10

tensor(0.2443, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2442, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2442, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2441, device='cuda:0', grad_fn=<NegBackward>)


training loop:   4% |##                                         | ETA:  0:17:06

tensor(0.4164, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4163, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4161, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4160, device='cuda:0', grad_fn=<NegBackward>)


training loop:   5% |##                                         | ETA:  0:17:02

tensor(0.4546, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4544, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4543, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4542, device='cuda:0', grad_fn=<NegBackward>)


training loop:   5% |##                                         | ETA:  0:16:58

tensor(0.4743, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4741, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4739, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4737, device='cuda:0', grad_fn=<NegBackward>)


training loop:   5% |##                                         | ETA:  0:16:55

tensor(0.4486, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4485, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4484, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4482, device='cuda:0', grad_fn=<NegBackward>)


training loop:   6% |##                                         | ETA:  0:16:51

tensor(0.4322, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4321, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4320, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4318, device='cuda:0', grad_fn=<NegBackward>)


training loop:   6% |##                                         | ETA:  0:16:49

tensor(0.3243, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3242, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3241, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3240, device='cuda:0', grad_fn=<NegBackward>)


training loop:   6% |##                                         | ETA:  0:16:47

tensor(0.2827, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2826, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2825, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2824, device='cuda:0', grad_fn=<NegBackward>)
Episode: 20, score: 0.047000
[0.         0.         0.         0.         0.21       0.
 0.29999999 0.         0.         0.         0.         0.
 0.07       0.         0.         0.35999999 0.         0.
 0.         0.        ]


training loop:   7% |###                                        | ETA:  0:16:46

tensor(0.2725, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2725, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2724, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2723, device='cuda:0', grad_fn=<NegBackward>)


training loop:   7% |###                                        | ETA:  0:16:44

tensor(0.4320, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4319, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4318, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4316, device='cuda:0', grad_fn=<NegBackward>)


training loop:   7% |###                                        | ETA:  0:16:41

tensor(0.3185, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3184, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3184, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3183, device='cuda:0', grad_fn=<NegBackward>)


training loop:   8% |###                                        | ETA:  0:16:39

tensor(0.3206, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3206, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3205, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3204, device='cuda:0', grad_fn=<NegBackward>)


training loop:   8% |###                                        | ETA:  0:16:36

tensor(0.1849, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1848, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1848, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1847, device='cuda:0', grad_fn=<NegBackward>)


training loop:   8% |###                                        | ETA:  0:16:32

tensor(0.2885, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2885, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2884, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2883, device='cuda:0', grad_fn=<NegBackward>)


training loop:   9% |###                                        | ETA:  0:16:28

tensor(0.1953, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1953, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1952, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1952, device='cuda:0', grad_fn=<NegBackward>)


training loop:   9% |####                                       | ETA:  0:16:24

tensor(0.2889, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2888, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2887, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2886, device='cuda:0', grad_fn=<NegBackward>)


training loop:   9% |####                                       | ETA:  0:16:22

tensor(0.2364, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2364, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2364, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2363, device='cuda:0', grad_fn=<NegBackward>)


training loop:  10% |####                                       | ETA:  0:16:19

tensor(0.2564, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2563, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2563, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2562, device='cuda:0', grad_fn=<NegBackward>)
Episode: 30, score: 0.028500
[0.         0.         0.         0.         0.         0.41999999
 0.         0.         0.         0.         0.         0.
 0.         0.15       0.         0.         0.         0.
 0.         0.        ]


training loop:  10% |####                                       | ETA:  0:16:18

tensor(0.2030, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2030, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2029, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2028, device='cuda:0', grad_fn=<NegBackward>)


training loop:  10% |####                                       | ETA:  0:16:15

tensor(0.0107, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0107, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0107, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0107, device='cuda:0', grad_fn=<NegBackward>)


training loop:  11% |####                                       | ETA:  0:16:11

tensor(0.0210, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0210, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0210, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0210, device='cuda:0', grad_fn=<NegBackward>)


training loop:  11% |####                                       | ETA:  0:16:07

tensor(0.1802, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1802, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1801, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1800, device='cuda:0', grad_fn=<NegBackward>)


training loop:  11% |#####                                      | ETA:  0:16:03

tensor(0.2449, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2449, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2448, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2448, device='cuda:0', grad_fn=<NegBackward>)


training loop:  12% |#####                                      | ETA:  0:15:59

tensor(0.0691, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0691, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0691, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0690, device='cuda:0', grad_fn=<NegBackward>)


training loop:  12% |#####                                      | ETA:  0:15:54

tensor(0.3140, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3139, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3139, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3138, device='cuda:0', grad_fn=<NegBackward>)


training loop:  12% |#####                                      | ETA:  0:15:50

tensor(0.2063, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2063, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2063, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2062, device='cuda:0', grad_fn=<NegBackward>)


training loop:  13% |#####                                      | ETA:  0:15:47

tensor(0.2073, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2073, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2072, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2072, device='cuda:0', grad_fn=<NegBackward>)


training loop:  13% |#####                                      | ETA:  0:15:44

tensor(0.1322, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1322, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1322, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1321, device='cuda:0', grad_fn=<NegBackward>)
Episode: 40, score: 0.006000
[0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.12 0.   0.  ]


training loop:  13% |#####                                      | ETA:  0:15:40

tensor(0.2294, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2294, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2293, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2292, device='cuda:0', grad_fn=<NegBackward>)


training loop:  14% |######                                     | ETA:  0:15:36

tensor(0.4067, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4066, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4065, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4063, device='cuda:0', grad_fn=<NegBackward>)


training loop:  14% |######                                     | ETA:  0:15:31

tensor(0.3695, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3695, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3693, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3692, device='cuda:0', grad_fn=<NegBackward>)


training loop:  14% |######                                     | ETA:  0:15:27

tensor(0.3381, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3380, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3379, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3378, device='cuda:0', grad_fn=<NegBackward>)


training loop:  15% |######                                     | ETA:  0:15:23

tensor(0.3363, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3363, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3362, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3360, device='cuda:0', grad_fn=<NegBackward>)


training loop:  15% |######                                     | ETA:  0:15:19

tensor(0.1594, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1594, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1593, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1593, device='cuda:0', grad_fn=<NegBackward>)


training loop:  15% |######                                     | ETA:  0:15:15

tensor(0.3744, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3744, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3743, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3741, device='cuda:0', grad_fn=<NegBackward>)


training loop:  16% |######                                     | ETA:  0:15:11

tensor(0.2891, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2891, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2890, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2889, device='cuda:0', grad_fn=<NegBackward>)


training loop:  16% |#######                                    | ETA:  0:15:08

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  16% |#######                                    | ETA:  0:15:04

tensor(0.2675, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2675, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2674, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2673, device='cuda:0', grad_fn=<NegBackward>)
Episode: 50, score: 0.029500
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.34999999 0.
 0.         0.         0.         0.         0.23999999 0.
 0.         0.        ]


training loop:  17% |#######                                    | ETA:  0:15:00

tensor(0.3468, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3467, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3465, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3463, device='cuda:0', grad_fn=<NegBackward>)


training loop:  17% |#######                                    | ETA:  0:14:56

tensor(0.2288, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2288, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2286, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2286, device='cuda:0', grad_fn=<NegBackward>)


training loop:  17% |#######                                    | ETA:  0:14:52

tensor(0.2843, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2842, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2842, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2841, device='cuda:0', grad_fn=<NegBackward>)


training loop:  18% |#######                                    | ETA:  0:14:48

tensor(0.2432, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2432, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2431, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2430, device='cuda:0', grad_fn=<NegBackward>)


training loop:  18% |#######                                    | ETA:  0:14:45

tensor(0.3834, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3834, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3832, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3831, device='cuda:0', grad_fn=<NegBackward>)


training loop:  18% |########                                   | ETA:  0:14:41

tensor(0.2194, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2194, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2193, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2192, device='cuda:0', grad_fn=<NegBackward>)


training loop:  19% |########                                   | ETA:  0:14:37

tensor(0.3289, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3288, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3287, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3286, device='cuda:0', grad_fn=<NegBackward>)


training loop:  19% |########                                   | ETA:  0:14:33

tensor(0.2786, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2785, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2785, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2784, device='cuda:0', grad_fn=<NegBackward>)


training loop:  19% |########                                   | ETA:  0:14:29

tensor(0.1781, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1780, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1780, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1779, device='cuda:0', grad_fn=<NegBackward>)


training loop:  20% |########                                   | ETA:  0:14:26

tensor(0.3980, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3979, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3977, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3975, device='cuda:0', grad_fn=<NegBackward>)
Episode: 60, score: 0.029500
[0.   0.   0.   0.   0.   0.   0.   0.   0.   0.16 0.07 0.   0.   0.17
 0.   0.   0.   0.   0.19 0.  ]


training loop:  20% |########                                   | ETA:  0:14:22

tensor(0.0241, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0241, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0241, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0241, device='cuda:0', grad_fn=<NegBackward>)


training loop:  20% |########                                   | ETA:  0:14:19

tensor(0.5006, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.5005, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.5003, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.5001, device='cuda:0', grad_fn=<NegBackward>)


training loop:  21% |#########                                  | ETA:  0:14:15

tensor(0.2027, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2028, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2026, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2025, device='cuda:0', grad_fn=<NegBackward>)


training loop:  21% |#########                                  | ETA:  0:14:11

tensor(0.0301, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0301, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0301, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0301, device='cuda:0', grad_fn=<NegBackward>)


training loop:  21% |#########                                  | ETA:  0:14:08

tensor(0.4014, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4013, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4012, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4010, device='cuda:0', grad_fn=<NegBackward>)


training loop:  22% |#########                                  | ETA:  0:14:04

tensor(0.2294, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2293, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2292, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2292, device='cuda:0', grad_fn=<NegBackward>)


training loop:  22% |#########                                  | ETA:  0:14:00

tensor(0.4114, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4113, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4112, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4110, device='cuda:0', grad_fn=<NegBackward>)


training loop:  22% |#########                                  | ETA:  0:13:57

tensor(0.2423, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2423, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2422, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2421, device='cuda:0', grad_fn=<NegBackward>)


training loop:  23% |#########                                  | ETA:  0:13:53

tensor(0.3174, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3173, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3172, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3172, device='cuda:0', grad_fn=<NegBackward>)


training loop:  23% |##########                                 | ETA:  0:13:49

tensor(0.2232, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2232, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2231, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2230, device='cuda:0', grad_fn=<NegBackward>)
Episode: 70, score: 0.009500
[0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.19 0.   0.   0.  ]


training loop:  23% |##########                                 | ETA:  0:13:46

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  24% |##########                                 | ETA:  0:13:42

tensor(0.2707, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2707, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2706, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2705, device='cuda:0', grad_fn=<NegBackward>)


training loop:  24% |##########                                 | ETA:  0:13:39

tensor(0.3011, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3010, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3009, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3008, device='cuda:0', grad_fn=<NegBackward>)


training loop:  24% |##########                                 | ETA:  0:13:35

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  25% |##########                                 | ETA:  0:13:32

tensor(0.1868, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1867, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1867, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1866, device='cuda:0', grad_fn=<NegBackward>)


training loop:  25% |##########                                 | ETA:  0:13:28

tensor(0.1107, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1106, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1106, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1105, device='cuda:0', grad_fn=<NegBackward>)


training loop:  25% |###########                                | ETA:  0:13:25

tensor(0.3481, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3480, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3478, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3477, device='cuda:0', grad_fn=<NegBackward>)


training loop:  26% |###########                                | ETA:  0:13:21

tensor(0.0826, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0826, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0826, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0825, device='cuda:0', grad_fn=<NegBackward>)


training loop:  26% |###########                                | ETA:  0:13:17

tensor(0.2412, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2411, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2411, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2409, device='cuda:0', grad_fn=<NegBackward>)


training loop:  26% |###########                                | ETA:  0:13:14

tensor(0.2305, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2305, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2304, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2303, device='cuda:0', grad_fn=<NegBackward>)
Episode: 80, score: 0.006000
[0.   0.   0.   0.   0.   0.   0.   0.   0.12 0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.  ]


training loop:  27% |###########                                | ETA:  0:13:10

tensor(0.2749, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2748, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2747, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2747, device='cuda:0', grad_fn=<NegBackward>)


training loop:  27% |###########                                | ETA:  0:13:07

tensor(0.3349, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3348, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3347, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3346, device='cuda:0', grad_fn=<NegBackward>)


training loop:  27% |###########                                | ETA:  0:13:03

tensor(0.1390, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1390, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1390, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1389, device='cuda:0', grad_fn=<NegBackward>)


training loop:  28% |############                               | ETA:  0:12:59

tensor(0.3756, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3756, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3754, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3753, device='cuda:0', grad_fn=<NegBackward>)


training loop:  28% |############                               | ETA:  0:12:56

tensor(0.4167, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4167, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4165, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4163, device='cuda:0', grad_fn=<NegBackward>)


training loop:  28% |############                               | ETA:  0:12:52

tensor(0.4024, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4023, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4022, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4020, device='cuda:0', grad_fn=<NegBackward>)


training loop:  29% |############                               | ETA:  0:12:49

tensor(0.1590, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1590, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1589, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1588, device='cuda:0', grad_fn=<NegBackward>)


training loop:  29% |############                               | ETA:  0:12:46

tensor(0.1554, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1554, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1553, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1553, device='cuda:0', grad_fn=<NegBackward>)


training loop:  29% |############                               | ETA:  0:12:44

tensor(0.3245, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3244, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3244, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3243, device='cuda:0', grad_fn=<NegBackward>)


training loop:  30% |############                               | ETA:  0:12:41

tensor(0.3742, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3742, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3740, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3739, device='cuda:0', grad_fn=<NegBackward>)
Episode: 90, score: 0.028500
[0.         0.         0.22999999 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.16       0.         0.         0.
 0.18       0.        ]


training loop:  30% |#############                              | ETA:  0:12:38

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  30% |#############                              | ETA:  0:12:36

tensor(0.0764, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0764, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0764, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0763, device='cuda:0', grad_fn=<NegBackward>)


training loop:  31% |#############                              | ETA:  0:12:32

tensor(0.3471, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3470, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3468, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3467, device='cuda:0', grad_fn=<NegBackward>)


training loop:  31% |#############                              | ETA:  0:12:28

tensor(0.2909, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2909, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2909, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2908, device='cuda:0', grad_fn=<NegBackward>)


training loop:  31% |#############                              | ETA:  0:12:25

tensor(0.1473, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1473, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1472, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1472, device='cuda:0', grad_fn=<NegBackward>)


training loop:  32% |#############                              | ETA:  0:12:21

tensor(0.4280, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4279, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4277, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4275, device='cuda:0', grad_fn=<NegBackward>)


training loop:  32% |#############                              | ETA:  0:12:18

tensor(0.3895, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3895, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3894, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3892, device='cuda:0', grad_fn=<NegBackward>)


training loop:  32% |##############                             | ETA:  0:12:14

tensor(0.3729, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3728, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3726, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3725, device='cuda:0', grad_fn=<NegBackward>)


training loop:  33% |##############                             | ETA:  0:12:11

tensor(0.2851, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2850, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2849, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2847, device='cuda:0', grad_fn=<NegBackward>)


training loop:  33% |##############                             | ETA:  0:12:08

tensor(0.1773, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1773, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1772, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1772, device='cuda:0', grad_fn=<NegBackward>)
Episode: 100, score: 0.010500
[0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.21 0.   0.   0.   0.   0.  ]


training loop:  33% |##############                             | ETA:  0:12:04

tensor(0.1858, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1858, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1857, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1856, device='cuda:0', grad_fn=<NegBackward>)


training loop:  34% |##############                             | ETA:  0:12:01

tensor(0.2590, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2589, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2588, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2587, device='cuda:0', grad_fn=<NegBackward>)


training loop:  34% |##############                             | ETA:  0:11:57

tensor(0.3347, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3347, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3346, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3345, device='cuda:0', grad_fn=<NegBackward>)


training loop:  34% |##############                             | ETA:  0:11:54

tensor(0.2043, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2042, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2042, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2041, device='cuda:0', grad_fn=<NegBackward>)


training loop:  35% |###############                            | ETA:  0:11:51

tensor(0.4129, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4128, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4126, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4125, device='cuda:0', grad_fn=<NegBackward>)


training loop:  35% |###############                            | ETA:  0:11:47

tensor(0.2386, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2386, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2386, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2385, device='cuda:0', grad_fn=<NegBackward>)


training loop:  35% |###############                            | ETA:  0:11:44

tensor(0.3848, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3847, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3846, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3845, device='cuda:0', grad_fn=<NegBackward>)


training loop:  36% |###############                            | ETA:  0:11:40

tensor(0.3170, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3169, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3167, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3166, device='cuda:0', grad_fn=<NegBackward>)


training loop:  36% |###############                            | ETA:  0:11:37

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  36% |###############                            | ETA:  0:11:33

tensor(0.2074, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2074, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2073, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2073, device='cuda:0', grad_fn=<NegBackward>)
Episode: 110, score: 0.004500
[0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.09]


training loop:  37% |###############                            | ETA:  0:11:30

tensor(0.3033, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3032, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3031, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3030, device='cuda:0', grad_fn=<NegBackward>)


training loop:  37% |################                           | ETA:  0:11:27

tensor(0.4619, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4618, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4616, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4614, device='cuda:0', grad_fn=<NegBackward>)


training loop:  37% |################                           | ETA:  0:11:23

tensor(0.1354, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1354, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1354, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1353, device='cuda:0', grad_fn=<NegBackward>)


training loop:  38% |################                           | ETA:  0:11:20

tensor(0.2713, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2712, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2711, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2711, device='cuda:0', grad_fn=<NegBackward>)


training loop:  38% |################                           | ETA:  0:11:16

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  38% |################                           | ETA:  0:11:13

tensor(0.2268, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2267, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2266, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2266, device='cuda:0', grad_fn=<NegBackward>)


training loop:  39% |################                           | ETA:  0:11:09

tensor(0.3286, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3286, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3285, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3284, device='cuda:0', grad_fn=<NegBackward>)


training loop:  39% |################                           | ETA:  0:11:06

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  39% |#################                          | ETA:  0:11:02

tensor(0.2373, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2373, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2372, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2371, device='cuda:0', grad_fn=<NegBackward>)


training loop:  40% |#################                          | ETA:  0:10:58

tensor(0.2842, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2842, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2841, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2840, device='cuda:0', grad_fn=<NegBackward>)
Episode: 120, score: 0.024500
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.48999999 0.        ]


training loop:  40% |#################                          | ETA:  0:10:55

tensor(0.4041, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4040, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4039, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4038, device='cuda:0', grad_fn=<NegBackward>)


training loop:  40% |#################                          | ETA:  0:10:51

tensor(0.1707, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1707, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1707, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1706, device='cuda:0', grad_fn=<NegBackward>)


training loop:  41% |#################                          | ETA:  0:10:47

tensor(0.1749, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1749, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1748, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1747, device='cuda:0', grad_fn=<NegBackward>)


training loop:  41% |#################                          | ETA:  0:10:44

tensor(0.0931, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0931, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0930, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0930, device='cuda:0', grad_fn=<NegBackward>)


training loop:  41% |#################                          | ETA:  0:10:40

tensor(0.1123, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1123, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1123, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1122, device='cuda:0', grad_fn=<NegBackward>)


training loop:  42% |##################                         | ETA:  0:10:36

tensor(0.0884, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0884, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0884, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0883, device='cuda:0', grad_fn=<NegBackward>)


training loop:  42% |##################                         | ETA:  0:10:32

tensor(0.1736, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1735, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1735, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1734, device='cuda:0', grad_fn=<NegBackward>)


training loop:  42% |##################                         | ETA:  0:10:29

tensor(0.2489, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2489, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2488, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2487, device='cuda:0', grad_fn=<NegBackward>)


training loop:  43% |##################                         | ETA:  0:10:25

tensor(0.1890, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1890, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1889, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1889, device='cuda:0', grad_fn=<NegBackward>)


training loop:  43% |##################                         | ETA:  0:10:22

tensor(0.0368, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0368, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0368, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0368, device='cuda:0', grad_fn=<NegBackward>)
Episode: 130, score: 0.005000
[0.  0.  0.  0.  0.1 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0. ]


training loop:  43% |##################                         | ETA:  0:10:18

tensor(0.1789, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1789, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1788, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1787, device='cuda:0', grad_fn=<NegBackward>)


training loop:  44% |##################                         | ETA:  0:10:14

tensor(0.1481, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1481, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1480, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1480, device='cuda:0', grad_fn=<NegBackward>)


training loop:  44% |###################                        | ETA:  0:10:11

tensor(0.3292, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3292, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3291, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3290, device='cuda:0', grad_fn=<NegBackward>)


training loop:  44% |###################                        | ETA:  0:10:08

tensor(0.3254, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3254, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3252, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3251, device='cuda:0', grad_fn=<NegBackward>)


training loop:  45% |###################                        | ETA:  0:10:04

tensor(0.2907, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2907, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2906, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2905, device='cuda:0', grad_fn=<NegBackward>)


training loop:  45% |###################                        | ETA:  0:10:01

tensor(0.3116, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3116, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3114, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3113, device='cuda:0', grad_fn=<NegBackward>)


training loop:  45% |###################                        | ETA:  0:09:57

tensor(0.0912, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0912, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0912, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0912, device='cuda:0', grad_fn=<NegBackward>)


training loop:  46% |###################                        | ETA:  0:09:54

tensor(0.1149, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1149, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1149, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1148, device='cuda:0', grad_fn=<NegBackward>)


training loop:  46% |###################                        | ETA:  0:09:50

tensor(0.0618, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0618, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0619, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0618, device='cuda:0', grad_fn=<NegBackward>)


training loop:  46% |####################                       | ETA:  0:09:47

tensor(0.1742, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1742, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1742, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1741, device='cuda:0', grad_fn=<NegBackward>)
Episode: 140, score: 0.016000
[0.   0.12 0.   0.   0.   0.   0.   0.2  0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.  ]


training loop:  47% |####################                       | ETA:  0:09:43

tensor(0.0841, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0841, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0841, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0841, device='cuda:0', grad_fn=<NegBackward>)


training loop:  47% |####################                       | ETA:  0:09:39

tensor(0.3638, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3637, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3636, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3635, device='cuda:0', grad_fn=<NegBackward>)


training loop:  47% |####################                       | ETA:  0:09:36

tensor(0.4161, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4160, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4159, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4158, device='cuda:0', grad_fn=<NegBackward>)


training loop:  48% |####################                       | ETA:  0:09:32

tensor(0.3478, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3477, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3476, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3475, device='cuda:0', grad_fn=<NegBackward>)


training loop:  48% |####################                       | ETA:  0:09:28

tensor(0.1975, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1975, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1974, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1974, device='cuda:0', grad_fn=<NegBackward>)


training loop:  48% |####################                       | ETA:  0:09:24

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  49% |#####################                      | ETA:  0:09:21

tensor(0.0304, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0304, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0304, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0304, device='cuda:0', grad_fn=<NegBackward>)


training loop:  49% |#####################                      | ETA:  0:09:17

tensor(0.2039, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2039, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2038, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2038, device='cuda:0', grad_fn=<NegBackward>)


training loop:  49% |#####################                      | ETA:  0:09:13

tensor(0.2929, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2929, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2928, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2927, device='cuda:0', grad_fn=<NegBackward>)


training loop:  50% |#####################                      | ETA:  0:09:09

tensor(0.2855, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2854, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2853, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2852, device='cuda:0', grad_fn=<NegBackward>)
Episode: 150, score: 0.015500
[0.11 0.   0.   0.   0.   0.2  0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.  ]


training loop:  50% |#####################                      | ETA:  0:09:06

tensor(0.3105, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3104, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3104, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3103, device='cuda:0', grad_fn=<NegBackward>)


training loop:  50% |#####################                      | ETA:  0:09:02

tensor(0.1397, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1397, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1397, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1396, device='cuda:0', grad_fn=<NegBackward>)


training loop:  51% |#####################                      | ETA:  0:08:58

tensor(0.1604, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1604, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1604, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1603, device='cuda:0', grad_fn=<NegBackward>)


training loop:  51% |######################                     | ETA:  0:08:54

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  51% |######################                     | ETA:  0:08:51

tensor(0.1963, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1962, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1962, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1961, device='cuda:0', grad_fn=<NegBackward>)


training loop:  52% |######################                     | ETA:  0:08:47

tensor(0.2689, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2689, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2688, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2688, device='cuda:0', grad_fn=<NegBackward>)


training loop:  52% |######################                     | ETA:  0:08:43

tensor(0.0741, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0741, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0741, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0741, device='cuda:0', grad_fn=<NegBackward>)


training loop:  52% |######################                     | ETA:  0:08:39

tensor(0.3159, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3159, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3158, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3157, device='cuda:0', grad_fn=<NegBackward>)


training loop:  53% |######################                     | ETA:  0:08:36

tensor(0.3212, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3211, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3210, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3209, device='cuda:0', grad_fn=<NegBackward>)


training loop:  53% |######################                     | ETA:  0:08:32

tensor(0.3588, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3587, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3585, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3584, device='cuda:0', grad_fn=<NegBackward>)
Episode: 160, score: 0.026500
[0.   0.   0.   0.   0.13 0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.21 0.   0.   0.   0.19 0.  ]


training loop:  53% |#######################                    | ETA:  0:08:28

tensor(0.3166, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3166, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3165, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3165, device='cuda:0', grad_fn=<NegBackward>)


training loop:  54% |#######################                    | ETA:  0:08:24

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  54% |#######################                    | ETA:  0:08:21

tensor(0.2169, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2168, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2168, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2167, device='cuda:0', grad_fn=<NegBackward>)


training loop:  54% |#######################                    | ETA:  0:08:17

tensor(0.2915, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2914, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2914, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2913, device='cuda:0', grad_fn=<NegBackward>)


training loop:  55% |#######################                    | ETA:  0:08:13

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  55% |#######################                    | ETA:  0:08:10

tensor(0.3192, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3192, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3191, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3190, device='cuda:0', grad_fn=<NegBackward>)


training loop:  55% |#######################                    | ETA:  0:08:06

tensor(0.0190, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0190, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0190, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0190, device='cuda:0', grad_fn=<NegBackward>)


training loop:  56% |########################                   | ETA:  0:08:02

tensor(0.1604, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1603, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1603, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1602, device='cuda:0', grad_fn=<NegBackward>)


training loop:  56% |########################                   | ETA:  0:07:58

tensor(0.0978, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0978, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0977, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0977, device='cuda:0', grad_fn=<NegBackward>)


training loop:  56% |########################                   | ETA:  0:07:55

tensor(0.1372, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1371, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1371, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1371, device='cuda:0', grad_fn=<NegBackward>)
Episode: 170, score: 0.020000
[0.22999999 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.17       0.        ]


training loop:  57% |########################                   | ETA:  0:07:51

tensor(0.2853, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2853, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2851, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2850, device='cuda:0', grad_fn=<NegBackward>)


training loop:  57% |########################                   | ETA:  0:07:47

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  57% |########################                   | ETA:  0:07:44

tensor(0.3022, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3022, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3021, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3020, device='cuda:0', grad_fn=<NegBackward>)


training loop:  58% |########################                   | ETA:  0:07:40

tensor(0.3996, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3997, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3996, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3992, device='cuda:0', grad_fn=<NegBackward>)


training loop:  58% |#########################                  | ETA:  0:07:36

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  58% |#########################                  | ETA:  0:07:32

tensor(0.1079, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1079, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1078, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1078, device='cuda:0', grad_fn=<NegBackward>)


training loop:  59% |#########################                  | ETA:  0:07:29

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  59% |#########################                  | ETA:  0:07:25

tensor(0.3391, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3390, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3388, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3387, device='cuda:0', grad_fn=<NegBackward>)


training loop:  59% |#########################                  | ETA:  0:07:21

tensor(0.1215, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1215, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1215, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1214, device='cuda:0', grad_fn=<NegBackward>)


training loop:  60% |#########################                  | ETA:  0:07:18

tensor(0.2518, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2518, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2517, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2516, device='cuda:0', grad_fn=<NegBackward>)
Episode: 180, score: 0.014500
[0.         0.         0.         0.         0.         0.
 0.         0.         0.28999999 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.        ]


training loop:  60% |#########################                  | ETA:  0:07:14

tensor(0.0373, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0373, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0373, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0373, device='cuda:0', grad_fn=<NegBackward>)


training loop:  60% |##########################                 | ETA:  0:07:10

tensor(0.1110, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1110, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1109, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1109, device='cuda:0', grad_fn=<NegBackward>)


training loop:  61% |##########################                 | ETA:  0:07:06

tensor(0.3012, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3012, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3011, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3009, device='cuda:0', grad_fn=<NegBackward>)


training loop:  61% |##########################                 | ETA:  0:07:03

tensor(0.1059, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1059, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1059, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1058, device='cuda:0', grad_fn=<NegBackward>)


training loop:  61% |##########################                 | ETA:  0:06:59

tensor(0.0408, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0407, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0407, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0407, device='cuda:0', grad_fn=<NegBackward>)


training loop:  62% |##########################                 | ETA:  0:06:55

tensor(0.1698, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1698, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1697, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1697, device='cuda:0', grad_fn=<NegBackward>)


training loop:  62% |##########################                 | ETA:  0:06:52

tensor(0.0622, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0622, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0621, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0621, device='cuda:0', grad_fn=<NegBackward>)


training loop:  62% |##########################                 | ETA:  0:06:48

tensor(0.3865, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3864, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3862, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3860, device='cuda:0', grad_fn=<NegBackward>)


training loop:  63% |###########################                | ETA:  0:06:44

tensor(0.4341, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4341, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4340, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4337, device='cuda:0', grad_fn=<NegBackward>)


training loop:  63% |###########################                | ETA:  0:06:41

tensor(0.2135, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2135, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2134, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2133, device='cuda:0', grad_fn=<NegBackward>)
Episode: 190, score: 0.030000
[0.         0.07       0.         0.         0.         0.
 0.         0.         0.35999999 0.04       0.         0.
 0.         0.         0.         0.         0.13       0.
 0.         0.        ]


training loop:  63% |###########################                | ETA:  0:06:37

tensor(0.3971, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3971, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3970, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3968, device='cuda:0', grad_fn=<NegBackward>)


training loop:  64% |###########################                | ETA:  0:06:33

tensor(0.2629, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2628, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2627, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2626, device='cuda:0', grad_fn=<NegBackward>)


training loop:  64% |###########################                | ETA:  0:06:30

tensor(0.2580, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2580, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2579, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2579, device='cuda:0', grad_fn=<NegBackward>)


training loop:  64% |###########################                | ETA:  0:06:26

tensor(0.4438, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4438, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4439, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4437, device='cuda:0', grad_fn=<NegBackward>)


training loop:  65% |###########################                | ETA:  0:06:22

tensor(0.2711, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2718, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2710, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2715, device='cuda:0', grad_fn=<NegBackward>)


training loop:  65% |############################               | ETA:  0:06:19

tensor(0.2963, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2966, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2961, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2963, device='cuda:0', grad_fn=<NegBackward>)


training loop:  65% |############################               | ETA:  0:06:15

tensor(0.3650, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3652, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3649, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3648, device='cuda:0', grad_fn=<NegBackward>)


training loop:  66% |############################               | ETA:  0:06:11

tensor(0.2584, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2586, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2583, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2584, device='cuda:0', grad_fn=<NegBackward>)


training loop:  66% |############################               | ETA:  0:06:08

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  66% |############################               | ETA:  0:06:04

tensor(0.3828, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3828, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3826, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3825, device='cuda:0', grad_fn=<NegBackward>)
Episode: 200, score: 0.027000
[0.         0.         0.         0.         0.         0.29999999
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.12       0.
 0.         0.12      ]


training loop:  67% |############################               | ETA:  0:06:00

tensor(0.2191, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2191, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2189, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2188, device='cuda:0', grad_fn=<NegBackward>)


training loop:  67% |############################               | ETA:  0:05:57

tensor(0.3851, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3851, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3849, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3848, device='cuda:0', grad_fn=<NegBackward>)


training loop:  67% |#############################              | ETA:  0:05:53

tensor(0.3942, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3941, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3940, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3938, device='cuda:0', grad_fn=<NegBackward>)


training loop:  68% |#############################              | ETA:  0:05:49

tensor(0.1510, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1509, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1509, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1508, device='cuda:0', grad_fn=<NegBackward>)


training loop:  68% |#############################              | ETA:  0:05:46

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  68% |#############################              | ETA:  0:05:42

tensor(0.3762, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3761, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3759, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3758, device='cuda:0', grad_fn=<NegBackward>)


training loop:  69% |#############################              | ETA:  0:05:38

tensor(0.1242, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1242, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1242, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1241, device='cuda:0', grad_fn=<NegBackward>)


training loop:  69% |#############################              | ETA:  0:05:35

tensor(0.2836, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2836, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2835, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2834, device='cuda:0', grad_fn=<NegBackward>)


training loop:  69% |#############################              | ETA:  0:05:31

tensor(0.4371, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4369, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4368, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4366, device='cuda:0', grad_fn=<NegBackward>)


training loop:  70% |##############################             | ETA:  0:05:27

tensor(0.2523, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2522, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2521, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2519, device='cuda:0', grad_fn=<NegBackward>)
Episode: 210, score: 0.021000
[0.13 0.   0.   0.   0.   0.   0.07 0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.22 0.   0.   0.  ]


training loop:  70% |##############################             | ETA:  0:05:24

tensor(0.4480, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4478, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4477, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4475, device='cuda:0', grad_fn=<NegBackward>)


training loop:  70% |##############################             | ETA:  0:05:20

tensor(0.0788, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0788, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0787, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0787, device='cuda:0', grad_fn=<NegBackward>)


training loop:  71% |##############################             | ETA:  0:05:16

tensor(0.1012, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1011, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1011, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1011, device='cuda:0', grad_fn=<NegBackward>)


training loop:  71% |##############################             | ETA:  0:05:13

tensor(0.3372, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3372, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3371, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3370, device='cuda:0', grad_fn=<NegBackward>)


training loop:  71% |##############################             | ETA:  0:05:09

tensor(0.3199, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3198, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3197, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3196, device='cuda:0', grad_fn=<NegBackward>)


training loop:  72% |##############################             | ETA:  0:05:05

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  72% |###############################            | ETA:  0:05:02

tensor(0.2995, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2995, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2994, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2993, device='cuda:0', grad_fn=<NegBackward>)


training loop:  72% |###############################            | ETA:  0:04:58

tensor(0.3617, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3617, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3615, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3614, device='cuda:0', grad_fn=<NegBackward>)


training loop:  73% |###############################            | ETA:  0:04:54

tensor(0.2549, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2549, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2548, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2548, device='cuda:0', grad_fn=<NegBackward>)


training loop:  73% |###############################            | ETA:  0:04:51

tensor(0.3910, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3909, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3908, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3907, device='cuda:0', grad_fn=<NegBackward>)
Episode: 220, score: 0.009000
[0.   0.   0.   0.   0.   0.02 0.   0.   0.1  0.   0.   0.   0.   0.
 0.06 0.   0.   0.   0.   0.  ]


training loop:  73% |###############################            | ETA:  0:04:47

tensor(0.3080, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3080, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3079, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3078, device='cuda:0', grad_fn=<NegBackward>)


training loop:  74% |###############################            | ETA:  0:04:43

tensor(0.2928, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2928, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2927, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2926, device='cuda:0', grad_fn=<NegBackward>)


training loop:  74% |###############################            | ETA:  0:04:40

tensor(0.1248, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1247, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1247, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1247, device='cuda:0', grad_fn=<NegBackward>)


training loop:  74% |################################           | ETA:  0:04:36

tensor(0.2666, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2666, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2664, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2663, device='cuda:0', grad_fn=<NegBackward>)


training loop:  75% |################################           | ETA:  0:04:32

tensor(0.1242, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1242, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1241, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1240, device='cuda:0', grad_fn=<NegBackward>)


training loop:  75% |################################           | ETA:  0:04:29

tensor(0.2377, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2376, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2376, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2375, device='cuda:0', grad_fn=<NegBackward>)


training loop:  75% |################################           | ETA:  0:04:25

tensor(0.4298, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4297, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4295, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4293, device='cuda:0', grad_fn=<NegBackward>)


training loop:  76% |################################           | ETA:  0:04:21

tensor(0.0542, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0542, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0542, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0542, device='cuda:0', grad_fn=<NegBackward>)


training loop:  76% |################################           | ETA:  0:04:18

tensor(0.2945, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2945, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2944, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2943, device='cuda:0', grad_fn=<NegBackward>)


training loop:  76% |################################           | ETA:  0:04:14

tensor(0.2303, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2303, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2302, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2301, device='cuda:0', grad_fn=<NegBackward>)
Episode: 230, score: 0.008500
[0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.17
 0.   0.   0.   0.   0.   0.  ]


training loop:  77% |#################################          | ETA:  0:04:10

tensor(0.2269, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2269, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2268, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2267, device='cuda:0', grad_fn=<NegBackward>)


training loop:  77% |#################################          | ETA:  0:04:07

tensor(0.3580, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3579, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3578, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3577, device='cuda:0', grad_fn=<NegBackward>)


training loop:  77% |#################################          | ETA:  0:04:03

tensor(0.2683, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2682, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2681, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2681, device='cuda:0', grad_fn=<NegBackward>)


training loop:  78% |#################################          | ETA:  0:03:59

tensor(0.2181, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2181, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2180, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2179, device='cuda:0', grad_fn=<NegBackward>)


training loop:  78% |#################################          | ETA:  0:03:56

tensor(0.2998, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2998, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2997, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2996, device='cuda:0', grad_fn=<NegBackward>)


training loop:  78% |#################################          | ETA:  0:03:52

tensor(0.3355, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3355, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3353, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3352, device='cuda:0', grad_fn=<NegBackward>)


training loop:  79% |#################################          | ETA:  0:03:48

tensor(0.3380, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3380, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3379, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3379, device='cuda:0', grad_fn=<NegBackward>)


training loop:  79% |##################################         | ETA:  0:03:45

tensor(0.2724, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2724, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2723, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2722, device='cuda:0', grad_fn=<NegBackward>)


training loop:  79% |##################################         | ETA:  0:03:41

tensor(0.1255, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1255, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1254, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1253, device='cuda:0', grad_fn=<NegBackward>)


training loop:  80% |##################################         | ETA:  0:03:38

tensor(0.2257, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2257, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2256, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2255, device='cuda:0', grad_fn=<NegBackward>)
Episode: 240, score: 0.016000
[0.         0.         0.         0.         0.         0.
 0.25999999 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.06      ]


training loop:  80% |##################################         | ETA:  0:03:34

tensor(0.1178, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1178, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1178, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1177, device='cuda:0', grad_fn=<NegBackward>)


training loop:  80% |##################################         | ETA:  0:03:30

tensor(0.2787, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2787, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2786, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2785, device='cuda:0', grad_fn=<NegBackward>)


training loop:  81% |##################################         | ETA:  0:03:27

tensor(0.2120, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2120, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2119, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2118, device='cuda:0', grad_fn=<NegBackward>)


training loop:  81% |##################################         | ETA:  0:03:23

tensor(0.0923, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0922, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0922, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0922, device='cuda:0', grad_fn=<NegBackward>)


training loop:  81% |###################################        | ETA:  0:03:19

tensor(0.3410, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3409, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3408, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3407, device='cuda:0', grad_fn=<NegBackward>)


training loop:  82% |###################################        | ETA:  0:03:16

tensor(0.3374, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3373, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3372, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3371, device='cuda:0', grad_fn=<NegBackward>)


training loop:  82% |###################################        | ETA:  0:03:12

tensor(0.2451, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2450, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2449, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2448, device='cuda:0', grad_fn=<NegBackward>)


training loop:  82% |###################################        | ETA:  0:03:08

tensor(0.0892, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0892, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0892, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0891, device='cuda:0', grad_fn=<NegBackward>)


training loop:  83% |###################################        | ETA:  0:03:05

tensor(0.4296, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4295, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4294, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4292, device='cuda:0', grad_fn=<NegBackward>)


training loop:  83% |###################################        | ETA:  0:03:01

tensor(0.1532, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1531, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1531, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1531, device='cuda:0', grad_fn=<NegBackward>)
Episode: 250, score: 0.006500
[0.   0.   0.13 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.  ]


training loop:  83% |###################################        | ETA:  0:02:58

tensor(0.2875, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2874, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2873, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2872, device='cuda:0', grad_fn=<NegBackward>)


training loop:  84% |####################################       | ETA:  0:02:54

tensor(0.2442, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2441, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2440, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2440, device='cuda:0', grad_fn=<NegBackward>)


training loop:  84% |####################################       | ETA:  0:02:50

tensor(0.4803, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4801, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4799, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4796, device='cuda:0', grad_fn=<NegBackward>)


training loop:  84% |####################################       | ETA:  0:02:47

tensor(0.3907, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3906, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3905, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3904, device='cuda:0', grad_fn=<NegBackward>)


training loop:  85% |####################################       | ETA:  0:02:43

tensor(0.3272, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3272, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3270, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3269, device='cuda:0', grad_fn=<NegBackward>)


training loop:  85% |####################################       | ETA:  0:02:39

tensor(0.2848, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2847, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2846, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2845, device='cuda:0', grad_fn=<NegBackward>)


training loop:  85% |####################################       | ETA:  0:02:36

tensor(0.3651, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3650, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3648, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3647, device='cuda:0', grad_fn=<NegBackward>)


training loop:  86% |####################################       | ETA:  0:02:32

tensor(0.3506, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3505, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3504, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3502, device='cuda:0', grad_fn=<NegBackward>)


training loop:  86% |#####################################      | ETA:  0:02:28

tensor(0.3945, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3945, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3943, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3942, device='cuda:0', grad_fn=<NegBackward>)


training loop:  86% |#####################################      | ETA:  0:02:25

tensor(0.2574, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2575, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2573, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2573, device='cuda:0', grad_fn=<NegBackward>)
Episode: 260, score: 0.016000
[0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.18 0.   0.
 0.   0.   0.   0.14 0.   0.  ]


training loop:  87% |#####################################      | ETA:  0:02:21

tensor(0.2716, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2715, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2714, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2713, device='cuda:0', grad_fn=<NegBackward>)


training loop:  87% |#####################################      | ETA:  0:02:17

tensor(0.2590, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2589, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2588, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2587, device='cuda:0', grad_fn=<NegBackward>)


training loop:  87% |#####################################      | ETA:  0:02:14

tensor(0.2886, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2885, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2885, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2884, device='cuda:0', grad_fn=<NegBackward>)


training loop:  88% |#####################################      | ETA:  0:02:10

tensor(0.2692, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2692, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2691, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2690, device='cuda:0', grad_fn=<NegBackward>)


training loop:  88% |#####################################      | ETA:  0:02:07

tensor(0.2127, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2127, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2126, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2125, device='cuda:0', grad_fn=<NegBackward>)


training loop:  88% |######################################     | ETA:  0:02:03

tensor(0.2746, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2745, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2745, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2744, device='cuda:0', grad_fn=<NegBackward>)


training loop:  89% |######################################     | ETA:  0:01:59

tensor(0.2124, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2123, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2122, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2122, device='cuda:0', grad_fn=<NegBackward>)


training loop:  89% |######################################     | ETA:  0:01:56

tensor(0.1883, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1882, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1882, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1881, device='cuda:0', grad_fn=<NegBackward>)


training loop:  89% |######################################     | ETA:  0:01:52

tensor(0.3064, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3063, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3062, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3061, device='cuda:0', grad_fn=<NegBackward>)


training loop:  90% |######################################     | ETA:  0:01:48

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
Episode: 270, score: 0.000000
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


training loop:  90% |######################################     | ETA:  0:01:45

tensor(0.0901, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0901, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0901, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0900, device='cuda:0', grad_fn=<NegBackward>)


training loop:  90% |######################################     | ETA:  0:01:41

tensor(0.0343, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0342, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0342, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0342, device='cuda:0', grad_fn=<NegBackward>)


training loop:  91% |#######################################    | ETA:  0:01:38

tensor(0.3976, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3976, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3974, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3973, device='cuda:0', grad_fn=<NegBackward>)


training loop:  91% |#######################################    | ETA:  0:01:34

tensor(0.2669, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2668, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2667, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2666, device='cuda:0', grad_fn=<NegBackward>)


training loop:  91% |#######################################    | ETA:  0:01:30

tensor(0.3287, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3286, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3286, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3285, device='cuda:0', grad_fn=<NegBackward>)


training loop:  92% |#######################################    | ETA:  0:01:27

tensor(0.2036, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2036, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2035, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2034, device='cuda:0', grad_fn=<NegBackward>)


training loop:  92% |#######################################    | ETA:  0:01:23

tensor(0.2335, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2334, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2333, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2332, device='cuda:0', grad_fn=<NegBackward>)


training loop:  92% |#######################################    | ETA:  0:01:19

tensor(0.2860, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2859, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2859, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2858, device='cuda:0', grad_fn=<NegBackward>)


training loop:  93% |#######################################    | ETA:  0:01:16

tensor(0.1784, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1784, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1783, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1782, device='cuda:0', grad_fn=<NegBackward>)


training loop:  93% |########################################   | ETA:  0:01:12

tensor(0.2894, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2893, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2892, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2891, device='cuda:0', grad_fn=<NegBackward>)
Episode: 280, score: 0.078000
[0.         0.2        0.         0.26999999 0.11       0.
 0.         0.         0.         0.         0.         0.05
 0.         0.70999998 0.         0.         0.         0.22
 0.         0.        ]


training loop:  93% |########################################   | ETA:  0:01:08

tensor(0.3030, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3030, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3029, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3027, device='cuda:0', grad_fn=<NegBackward>)


training loop:  94% |########################################   | ETA:  0:01:05

tensor(0.2404, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2403, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2402, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2401, device='cuda:0', grad_fn=<NegBackward>)


training loop:  94% |########################################   | ETA:  0:01:01

tensor(0.3407, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3405, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3403, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3402, device='cuda:0', grad_fn=<NegBackward>)


training loop:  94% |########################################   | ETA:  0:00:58

tensor(0.3228, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3228, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3227, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3226, device='cuda:0', grad_fn=<NegBackward>)


training loop:  95% |########################################   | ETA:  0:00:54

tensor(0.4067, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4068, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4065, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4063, device='cuda:0', grad_fn=<NegBackward>)


training loop:  95% |########################################   | ETA:  0:00:50

tensor(0.2642, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2643, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2641, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2641, device='cuda:0', grad_fn=<NegBackward>)


training loop:  95% |#########################################  | ETA:  0:00:47

tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)
tensor(-0., device='cuda:0', grad_fn=<NegBackward>)


training loop:  96% |#########################################  | ETA:  0:00:43

tensor(0.3738, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3737, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3736, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3735, device='cuda:0', grad_fn=<NegBackward>)


training loop:  96% |#########################################  | ETA:  0:00:39

tensor(0.4689, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4688, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4686, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4684, device='cuda:0', grad_fn=<NegBackward>)


training loop:  96% |#########################################  | ETA:  0:00:36

tensor(0.3163, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3163, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3162, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3161, device='cuda:0', grad_fn=<NegBackward>)
Episode: 290, score: 0.024000
[0.         0.         0.         0.         0.         0.
 0.07       0.         0.         0.31999999 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.09      ]


training loop:  97% |#########################################  | ETA:  0:00:32

tensor(0.3361, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3361, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3359, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.3359, device='cuda:0', grad_fn=<NegBackward>)


training loop:  97% |#########################################  | ETA:  0:00:29

tensor(0.2939, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2938, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2937, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2937, device='cuda:0', grad_fn=<NegBackward>)


training loop:  97% |#########################################  | ETA:  0:00:25

tensor(0.2194, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2194, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2192, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2191, device='cuda:0', grad_fn=<NegBackward>)


training loop:  98% |########################################## | ETA:  0:00:21

tensor(0.1634, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1634, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1634, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1633, device='cuda:0', grad_fn=<NegBackward>)


training loop:  98% |########################################## | ETA:  0:00:18

tensor(0.2650, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2649, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2648, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.2648, device='cuda:0', grad_fn=<NegBackward>)


training loop:  98% |########################################## | ETA:  0:00:14

tensor(0.4261, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4259, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4257, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.4255, device='cuda:0', grad_fn=<NegBackward>)


training loop:  99% |########################################## | ETA:  0:00:10

tensor(0.1930, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1930, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1930, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1929, device='cuda:0', grad_fn=<NegBackward>)


training loop:  99% |########################################## | ETA:  0:00:07

tensor(0.1313, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1312, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1312, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1312, device='cuda:0', grad_fn=<NegBackward>)


training loop:  99% |########################################## | ETA:  0:00:03

tensor(0.1347, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1346, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1346, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.1346, device='cuda:0', grad_fn=<NegBackward>)


training loop: 100% |###########################################| Time: 0:18:09

tensor(0.0818, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0818, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0817, device='cuda:0', grad_fn=<NegBackward>)
tensor(0.0817, device='cuda:0', grad_fn=<NegBackward>)
Episode: 300, score: 0.011000
[0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.22 0.
 0.   0.   0.   0.   0.   0.  ]



