### Project 3: Tennis environment
Author: Md. Masud Rana

#### Unity ML-Agents [Tennis](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#tennis) Environment
In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.

The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
This yields a single score for each episode.
The environment is considered solved, when the average (over 100 episodes) of those scores is at least +0.5.

<img src="images/tennis.png">


### Implimentation
The  maddpg algorithm is an approximate Actor-Critic Method, but also resembles
the DQN approach of Reinforcement Learning. The agent is composed of two Neural
Networks (NNs  ) the Actor and the Critic, both with target and local networks totalizing 4
NNs, these are used to encode the  policy  function.

The learning pipeline takes first a state as input in the Actor network, outputting
the best possible action in that state, this procedure makes possible for ddpg to tackle
continuous action spaces, in contrast to the regular DQN approach. This action is used
in the Critic network, alongside with the state, where it outputs an action value function
(q), this q is used as a baseline for updating both Actor and Critic networks, reducing
the variance and instability of classic RL algorithms. The optimization is done with a
gradient ascent between both Actor’s and Critic’s target and local networks parameters.
The behaviour of the agent can be explored in the ​ maddpg_agent.py file.

Important libraries and components are imported and local parameters are initialized:
BUFFER_SIZE,defines the replay buffer size, shared by the agents, this is an object
that contains tuples called experiences composed by state,actions,rewards,next states
and dones, these are necessary informations for learning; ​ BATCH_SIZE ​ , when the
number of experiences in the replay buffer exceeds the batch size, the learning method
is called; ​ TAU ​ , this hyperparameter controls the model ​ soft updates , ​ a method used forslowly changing the target networks parameters, improving stability; ​ LR_ACTOR and
LR_CRITIC ​ , the optimizer learning rates, these control the gradient ascent step;
WEIGHT_DECAY ​ , the l2 regularization parameter of the optimizer.
The main implementation begins on fourth step: additional libraries and
components are imported, an ​ agent ​ is created and initialized with proper parameters:
state_size ​ and ​ action_size. 

The ​ maddpg function is created, taking as parameters the
number of episodes (​ n_episodes) ​ and the maximum length of each episode (​ max_t ​ ).
In each episode the environment is reseted and the agents receive initial states.
While the number of timesteps is less than ​ max_t, ​ the following procedures are done:
The agent use it’s ​ act method with the current state as input, the method takes
the input and passes it through the actor network, returning an action for the state. A
environment ​ step ​ is taken, using the previous obtained action, and it’s returns: next
state, rewards and dones (if the episode is terminated or not). These are stored in the
env_info, ​ variable, that passes them individually for each of these information’s new
variables. 

The agent uses it ​ step ​ method, the method first adds the experience tuple for
the shared replay buffer and, depending on the size, calls the ​ learn method. The
rewards are added to the scores variable and the state receives the next state, to give
continuation to the environment, if any of the components of the done variable indicates
that the episode is over, the loop of ​ max_t ​ breaks, and a new episode is initialized.
If the average score of the last 100 episodes is bigger than 0.5, the networks
weights are save and the loop of ​ n_episodes breaks and the ​ maddpg function returns a
list with each episode’s score. This list is plotted with the episodes number, showing the
Agent’s learning during the algorithm’s execution.

### Hyperparameter
There were many hyperparameters involved in the experiment. The value of each of them is given below:

|Hyperparameter	|Value|
|:-- | :--: |
|Replay buffer size|	1e5|
|Batch size|	128|
|GAMMA(discount factor)	|0.99|
|TAU	|2e-3|
|Actor Learning rate	|1.5e-3|
|Critic Learning rate	|1.5e-3|
|Number of episodes|	6000|
|Max number of timesteps per episode	|300|


### Results

The best performance was achieved by DDPG where the reward of +0.5 was achieved in 1799 episodes. It took so much time to find right parameter. Really it's hard to find right model hyperperameter. 

<img src="images/rewards.png">


### Ideas for Future Work

- Other algorithms like TRPO, PPO, A3C, A2C that have been discussed in the course could potentially lead to better results as well.

- The Q-prop algorithm, which combines both off-policy and on-policy learning, could be good one to try.

- General optimization techniques like cyclical learning rates and warm restarts could be useful as well.