# Report

---

In this notebook, I provided a description of my implementation for the third project -  **collaboration and competition** of the [Deep Reinforcement Learning Nanodegree](https//www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.


### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [None]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  Before running the code cell below, make sure the location of the Unity environment that you downloaded is as described in `Readme.md` file - `env/` folder.

In [None]:
env = UnityEnvironment(file_name="env/Tennis.app")

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [None]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
print("Default brain is:", brain_name)

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [None]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('\nThere are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('\nThe state for the first agent looks like:', states[0])
print('\nThe state for the second agent looks like:', states[1])

### 3. Define the DDPG Agent(s)

The Smart Agent that is going to be learned to control the double-jointed arm is coded in the [ddpg_agent.py](./ddpg_agent.py). 

The agent is constructed with DDPG learning algorithm. The agent can be initialized for learning or for play only based on a saved checkpoint (see below).

The DDPG (Deep Deterministic Poicy Gradient) actor-critic approach is recommended for this problem, main reason being the fact that agent must learn a continous spectrum of actions. 

The actor and critic NN models are coded in the [ddpg_model.py](./ddpg_model.py). The two models have two hidden layers each. At input the critic NN it gets a tensor having a number of elements equal with the number of state elements and in second layer also the action is introduced as input, while is producing at the output the estimated Q value the given state-action pair. The actor NN gets a tensor having a number of elements equal with number of states elements and is producing at the output the action value.

This type of algorithm is an of-policy one and combined with the fact that function approximation is used plus bootstraping this leads to potentially stability issues. The trick to stabilize the learning is use of replay buffer similar with the DQN algorithm.

The exploration is ensured by adding a noise to the actor NN returned action value.

### 4. Instantiate the learning Agent defined above

Instantiate the Agent and load from the filesystem the NN weights from checkpoints (if exists).

In [None]:
from ddpg_agent import Agent
import torch
from pathlib import Path
from collections import deque
import random
import torch
import numpy as np
import matplotlib.pyplot as plt

### 5. Train the Agent with DDPG Algorithm

Run the code cell below to train the agent (from scratch or continue training the models loaded from the files). The NN parameters are saved in files every 100 episodes so that traiing can be resumed later. Training stops when the average score over past 100 episodes goes beyond 0.5.

In [None]:
def ddpg(environment, agents, weights_actors, weights_critics, n_episodes=2000):
    brain_name = environment.brain_names[0]
    environment_info = environment.reset(train_mode=True)[brain_name]

    agents_size = len(agents)  # number of agents
    states_size = agents_size * environment_info.vector_observations.shape[1]  # size of the state space shared by the agents

    scores = []  # scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores

    for i_episode in range(1, n_episodes + 1):
        environment_info = environment.reset(train_mode=True)[brain_name]
        states = environment_info.vector_observations.reshape((1, states_size))

        # Reset the agents
        for agent in agents:
            agent.reset()

        scores_agents = np.zeros(agents_size)

        while True:

            # Perform actions in the environment
            actions = [agent.act(states, True) for agent in agents]  # execute actions with added noise
            actions = np.hstack(tuple(actions))  # stack the actions performed by the agents

            environment_info = environment.step(actions)[brain_name]  # send both agents' actions together to the environment
            next_states = environment_info.vector_observations.reshape((1, states_size))

            rewards = environment_info.rewards  # get reward
            dones = environment_info.local_done  # verify if episode finished

            for i, agent in enumerate(agents):
                agent.step(states, actions, rewards[i], next_states, dones[i], i)  # agent i learns

            scores_agents += rewards  # update the score for each agent
            states = next_states  # roll over states to next time step

            if np.any(dones):
                break
            
        scores_window.append(np.max(scores_agents))
        scores.append(np.max(scores_agents))
        
        #print('\rEpisode {}\tAverage Score: {:.4f}\tScore: {:.4f}'.format(i_episode, np.mean(scores_window), np.max(scores_agents), end=""))
        
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.4f}'.format(i_episode, np.mean(scores_window)))
            
            for agent, weights_actor, weights_critic in zip(agents, weights_actors, weights_critics):
                torch.save(agent.actor_local.state_dict(), weights_actor)
                torch.save(agent.critic_local.state_dict(), weights_critic)
        
        if np.mean(scores_window)>=0.5: # stop learning if the average score for the last 100 episodes is greater than 0.5
            print('Environment solved in {:d} episodes!\tAverage Score: {:.4f}'.format(i_episode-100, np.mean(scores_window)))
            break

    return scores

In [None]:

agents = [
        Agent(state_size=state_size, action_size=action_size, random_seed=0),
        Agent(state_size=state_size, action_size=action_size, random_seed=0)
]

weights_actors = [
        "actor1.pth",
        "actor2.pth"
]

    # Retrieve weights for the critics
weights_critics = [
        "critic1.pth",
        "critic2.pth"    
]

scores_list = ddpg(environment=env, agents=agents, weights_actors=weights_actors, weights_critics=weights_critics)

In [None]:
%matplotlib inline
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores_list)+1), scores_list)
plt.ylabel('Score')
plt.xlabel('Episode #')

window_mean = pd.Series(scores).rolling(100).mean()
plt.plot(window_mean, linewidth=4)
plt.show()

When finished, you can close the environment.

In [None]:
env.close()

### Ideas for future improvements

There are several improvements that can be made:
   - tune the hyper-parameters to train even faster