# Udacity P3: Collab Compet Project Report

## 1. Project Overview

This project implements a **Multi-Agent Deep Deterministic Policy Gradient (MADDPG)** approach to solve the **Tennis** environment from Unity ML-Agents.
The goal is for two agents to keep a ball in play by hitting it back and forth, maximizing their average score over time.

## 2. Environment Description


- **State size:** 24 (per agent)

- **Action size:** 2 (racket movement: horizontal and vertical)

- **Number of agents:** 2

- **Reward function:**

    - **+0.1** for hitting the ball over the net

    - **-0.01** if the ball hits the ground

## 3. Methods

### 3.1 Learning Algorithm

In this project, the **Deep Deterministic Policy Gradient (DDPG)** algorithm was implemented to train the agent.
DDPG is an off-policy actor-critic algorithm designed for continuous action spaces. It consists of:

* **Actor network:** Proposes continuous actions given a state.

* **Critic network:** Estimates the Q-value of state-action pairs to evaluate the actor’s actions.

* **Replay buffer:** Stores past experiences to stabilize training.

* **Soft target updates:** Gradual update of target networks.

* **Exploration noise:** Ornstein-Uhlenbeck process to encourage exploration.

**Note:** Adaptive Noise Decay: To encourage exploration during the early stages of training and gradually shift toward exploitation (σ *= 0.995) at each step.


### 3.2 Network Architecture

* **Actor network:** 3 fully connected layers with 400, 300 neurons, ReLU activations; output layer uses tanh activation to keep actions within valid range.

* **Critic network:** 3 fully connected layers; the first layer processes state inputs, followed by concatenation with actions and further layers with ReLU activations.

### 3.3 Hyperparameters

    BATCH_SIZE = 128        # Number of experiences sampled per training step to update the networks.
    GAMMA = 0.99            # Discount factor determining how much future rewards are taken into account.
    TAU = 1e-3              # Rate at which target networks softly track the learned networks.
    BUFFER_SIZE = int(2e5)  # replay buffer size
    LR_ACTOR = 1e-4         # learning rate of the actor
    LR_CRITIC = 1e-3        # learning rate of the critic
    WEIGHT_DECAY = 0        # L2 regularization term to prevent overfitting, set to zero here.
    sigma_decay = 0.995     # Factor by which exploration noise decreases gradually over time.
    min_sigma = 0.05        # Minimum allowed value for noise level to ensure some exploration remains.

## 4. Plot of Rewards According to Hyperparameters

Environment solved in 2948 episodes!	Average Score: 0.50

It was observed that the average reward always increased in subsequent epochs. As agents learned, the duration of each episode increased, so training was manually stopped after 4000 epochs.
  
 <img src="img/img.png" style="float: left;"/>
 


## 5. Results

* Initially, the agents acted almost randomly due to lack of experience and sparse rewards.

* Over time, they began to coordinate better, keeping the ball in play longer.

* The MADDPG approach proved effective for continuous control in a cooperative–competitive multi-agent setting.

## 6. Ideas for Future Work

* Experiment with reward shaping (e.g., bonus for faster ball speed) to speed up learning.

* Hyperparameter tuning (learning rates, τ, γ) for faster convergence.

* Try a double-critic architecture (TD3-style) for more stable learning.