# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np
from collections import deque

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name="./Tennis_Linux/Tennis.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# # reset the environment
# env_info = env.reset(train_mode=True)[brain_name]

# # number of agents 
# num_agents = len(env_info.agents)
# print('Number of agents:', num_agents)

# # size of each action
# action_size = brain.vector_action_space_size
# print('Size of each action:', action_size)

# # examine the state space 
# states = env_info.vector_observations
# state_size = states.shape[1]
# print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
# print('The state for the first agent looks like:', states[0])

### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agents and receive feedback from the environment.

Once this cell is executed, you will watch the agents' performance, if they select actions at random with each time step.  A window should pop up that allows you to observe the agents.

Of course, as part of the project, you'll have to change the code so that the agents are able to use their experiences to gradually choose better actions when interacting with the environment!

In [5]:
# for i in range(1, 6):                                      # play game for 5 episodes
#     env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
#     states = env_info.vector_observations                  # get the current state (for each agent)
#     scores = np.zeros(num_agents)                          # initialize the score (for each agent)
#     while True:
#         actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#         actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#         env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#         next_states = env_info.vector_observations         # get next state (for each agent)
#         rewards = env_info.rewards                         # get reward (for each agent)
#         dones = env_info.local_done                        # see if episode finished
#         scores += env_info.rewards                         # update the score (for each agent)
#         states = next_states                               # roll over states to next time step
#         if np.any(dones):                                  # exit loop if episode finished
#             break
#     print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
# from unityagents import UnityEnvironment
# import numpy as np

# env = UnityEnvironment(file_name="./Tennis_Linux/Tennis.x86_64", no_graphics=False)
# # get the default brain
# brain_name = env.brain_names[0]
# brain = env.brains[brain_name]

In [8]:
def env_extractor_state(env_info):
    states = env_info.vector_observations
    states = np.expand_dims(states, axis=0)
    return states


def env_extractor_reward(env_info):
    rewards = env_info.rewards
    rewards = np.array(rewards)
    rewards = np.expand_dims(rewards, axis=0)
    return rewards 


def env_extractor_done(env_info):
    dones = env_info.local_done
    dones = np.array(dones)
    dones = np.expand_dims(dones, axis=0)
    return dones

def env_extractor_global_state(env_info):
    agent_states = env_extractor_state(env_info)
    concat_states = agent_states.reshape(-1, )
    return np.expand_dims(concat_states, axis=0)
    

def evn_extractor(env_info):
    obs = env_extractor_state(env_info)        # get next state (for each agent)
    obs_full = env_extractor_global_state(env_info)
    rewards = env_extractor_reward(env_info)      # get reward (for each agent)
    dones = env_extractor_done(env_info)

    return obs, obs_full, rewards, dones


def actor_to_simulator(actions):
    actions_for_env = torch.stack(actions).detach().numpy().reshape(number_of_agents, -1)
    actions_for_env = np.clip(actions_for_env, -1, 1)
    return actions_for_env


def actor_to_buffer(actions):
    actions_for_buffer = np.rollaxis(torch.stack(actions).detach().numpy(), 1)
    actions_for_buffer = np.clip(actions_for_buffer, -1, 1)
    return actions_for_buffer




In [9]:
from buffer import ReplayBuffer
from maddpg import MADDPG
import torch
import numpy as np
from tensorboardX import SummaryWriter
import os
from utilities import transpose_list, transpose_to_tensor
import imageio

def seeding(seed=1):
    np.random.seed(seed)
    torch.manual_seed(seed)


!sh ./clean.sh
seeding()
# number of training episodes.
# change this to higher number to experiment. say 30000.
buffer = ReplayBuffer(int(1e6))
number_of_episodes = 2500
episode_length = 1000
batchsize = 256
episode_per_update = 1 
parallel_envs = 1
number_of_agents = 2
action_size = 2
number_of_episode_before_training = 100
number_of_learning_per_episode = 3
# how many episodes to save policy and gif
save_interval = 100
# what is this ?
t = 0

# amplitude of OU noise
# this slowly decreases to 0
noise = 1
noise_reduction = 0.99995

# how many episodes before update


log_path = os.getcwd()+"/log"
model_dir= os.getcwd()+"/model_dir"

os.makedirs(model_dir, exist_ok=True)



maddpg = MADDPG(number_of_agents, action_size)

logger = SummaryWriter(log_dir=log_path)
agent0_reward = []
agent1_reward = []


scores_deque = deque(maxlen=100) 
avg_score = []
transition_bucket = list()

In [None]:
for episode in range(0, number_of_episodes):
    reward_this_episode = np.zeros((parallel_envs, number_of_agents))
    env_info = env.reset(train_mode=True)[brain_name] 
    obs, obs_full, rewards, dones = evn_extractor(env_info)
#     print("jupyter first environment next obs", obs.shape, obs)
#     print("jupyter first environment next_obs_full", obs_full.shape , obs_full)
#     print("jupyter first environment reward", rewards.shape, rewards)
#     print("jupyter first environment done", dones.shape, dones)
    
    save_info = (episode % save_interval == 0)
#     print("jupyter save info signal", save_info)

    for episode_t in range(episode_length):
        # we finish the episode before sampling the buffer for trainint
        # t jumps forward in a multiple of environment
        t += parallel_envs
#         print("jupyter obs", obs.shape, obs)
        actions = maddpg.act(transpose_to_tensor(obs), noise=noise)
#         print("jupyter actions", actions)
        noise *= noise_reduction
#         print("jupyter noise", noise)
        actions_for_env = actor_to_simulator(actions)
#         print("jupyter action for env", actions_for_env)
        actions_for_buffer = actor_to_buffer(actions)
#         print("jupyter actions_for_buffer", actions_for_buffer)        
        env_info = env.step(actions_for_env)[brain_name] 

        next_obs, next_obs_full, rewards, dones = evn_extractor(env_info)
#         print("jupyter environment next obs", next_obs.shape, next_obs)
#         print("jupyter environment next_obs_full", next_obs_full.shape, next_obs_full)
#         print("jupyter environment reward", rewards.shape, rewards)
#         print("jupyter environment done", dones.shape, dones)


        transition = (obs, obs_full, actions_for_buffer, rewards, next_obs, next_obs_full, dones)
        buffer.push(transition)
#         transition_bucket.append(transition)

        reward_this_episode += rewards
#         print("jupyter reward this episode", reward_this_episode )
        
        if np.any(dones) or (episode_t + 1 == episode_length):
#             if np.max(reward_this_episode) >= -10:
#                 [buffer.push(transition) for transition in transition_bucket]
#             elif np.random.rand(1) > 0.5:
#                 [buffer.push(transition) for transition in transition_bucket]
            break

        obs, obs_full = next_obs, next_obs_full


    # update once after every episode_per_update
    if len(buffer) >= batchsize and (episode % episode_per_update == 0) and (episode > number_of_episode_before_training):
        for _ in range(number_of_learning_per_episode): #learn multiple times at every step
            for a_i in range(number_of_agents):
                samples = buffer.sample(batchsize)
    #             print("jupyter samples obs", samples[0])
    #             print("jupyter samples obs_full", samples[1])
    #             print("jupyter samples actions_for_buffer", samples[2])
    #             print("jupyter samples rewards", samples[3])
    #             print("jupyter samples next_obs", samples[4])
    #             print("jupyter samples dones", samples[5])
                maddpg.update(samples, a_i, logger)
    #             print("done with this line")
            maddpg.update_targets() #soft update the target network towards the actual networks

    for i in range(parallel_envs):
        agent0_reward.append(reward_this_episode[i, 0])
        agent1_reward.append(reward_this_episode[i, 1])

    max_reward_episode = np.max(reward_this_episode)
#     print("jupyter max reward", max_reward_episode)
    scores_deque.append(max_reward_episode)
    if episode % 50 == 0:
        print('\rEpisode {} \t Episode Max Reward {:.3f} \t  Average Trailing Max Score: {:.3f} noise_factor {}'\
              .format(episode, max_reward_episode, np.mean(scores_deque), str(noise)))

    
#     if (episode % 100 == 0) or (episode == number_of_episodes):
#         avg_rewards = [np.mean(agent0_reward), np.mean(agent1_reward)]
#         agent0_reward = []
#         agent1_reward = []
#         for a_i, avg_rew in enumerate(avg_rewards):
#             logger.add_scalar('agent%i/mean_episode_rewards' % a_i, avg_rew, episode)

#     #saving model
#     save_dict_list =[]
#     if save_info:
#         print("saving info episode {}, noise_factor {}".format(str(episode), str(noise)))
#         for i in range(number_of_agents):

#             save_dict = {'actor_params': maddpg.maddpg_agent[i].actor.state_dict(),
#                          'actor_optim_params': maddpg.maddpg_agent[i].actor_optimizer.state_dict(),
#                          'critic_params': maddpg.maddpg_agent[i].critic.state_dict(),
#                          'critic_optim_params': maddpg.maddpg_agent[i].critic_optimizer.state_dict()}
#             save_dict_list.append(save_dict)

#             torch.save(save_dict_list, 
#                        os.path.join(model_dir, 'episode-{}.pt'.format(episode)))


# env.close()
logger.close()

print("done")


Episode 0 	 Episode Max Reward 0.000 	  Average Trailing Max Score: 0.000 noise_factor 0.9992502624431336
Episode 50 	 Episode Max Reward 0.000 	  Average Trailing Max Score: 0.004 noise_factor 0.9618940483017872
Episode 100 	 Episode Max Reward 0.000 	  Average Trailing Max Score: 0.008 noise_factor 0.9221457143508329
Episode 150 	 Episode Max Reward 0.000 	  Average Trailing Max Score: 0.007 noise_factor 0.8887378677711368
Episode 200 	 Episode Max Reward 0.000 	  Average Trailing Max Score: 0.002 noise_factor 0.8570972795250429
Episode 250 	 Episode Max Reward 0.000 	  Average Trailing Max Score: 0.001 noise_factor 0.8272033357485484
Episode 300 	 Episode Max Reward 0.000 	  Average Trailing Max Score: 0.000 noise_factor 0.798352036600452
Episode 350 	 Episode Max Reward 0.000 	  Average Trailing Max Score: 0.000 noise_factor 0.7705070165938451
Episode 400 	 Episode Max Reward 0.100 	  Average Trailing Max Score: 0.002 noise_factor 0.7422586946485974
Episode 450 	 Episode Max Reward

In [None]:
for d in range(10):
    samples = buffer.sample(batchsize)
    for i in samples[3]:
        if np.any(i>0.01) :
            print(i)


In [None]:
for i in range(1, 2):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment
    obs, obs_full, rewards, dones = evn_extractor(env_info)# get the current state (for each agent)
    scores = np.zeros(number_of_agents)                          # initialize the score (for each agent)
    while True:
        actions = maddpg.act(transpose_to_tensor(obs), noise=noise)
        actions_for_env = torch.stack(actions).detach().numpy().reshape(number_of_agents, -1)
        actions_for_env = np.clip(actions_for_env, -1, 1)
        env_info = env.step(actions_for_env)[brain_name]   # send all actions to tne environment
        obs, obs_full, rewards, dones = evn_extractor(env_info)   # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        print(dones, rewards)
        scores += env_info.rewards                         # update the score (for each agent)                             # roll over states to next time step
#         if np.any(dones):                                  # exit loop if episode finished
#             break
    print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

In [None]:
from OUNoise import OUNoise
import numpy as np
import matplotlib.pyplot as plt
n = OUNoise(2, scale=0.1, mu=0, theta=0.15, sigma=0.5)
x = list()
y = list()
for i in range(5000):
    d = n.noise()
    x.append(d[0])
    y.append(d[1])


plt.scatter(np.array(x), np.array(y))
plt.show()