# Udacity P2: Continuous Control Project Report

## 1. Project Overview

This project focuses on solving a continuous control problem using deep reinforcement learning. The goal is to train an agent in the Unity ML-Agents “Reacher” environment, where a robotic arm learns to reach a target position by applying continuous actions. The Deep Deterministic Policy Gradient (DDPG) algorithm was implemented for this task.

## 2. Problem Description

The agent controls a robotic arm with continuous action space (motor commands) to reach a specified target position.

* Observation Space: State vector including positions and velocities (e.g., 33 dimensions).

* Action Space: Continuous values (e.g., 4-dimensional vector with values between -1 and 1).

* Reward: +0.1 is provided for each step that the agent's hand is in the goal location.

* Number of agents: 1 or 20 (It depends on whether the CPU or GPU is trained.)

## 3. Methods

### 3.1 Learning Algorithm

In this project, the **Deep Deterministic Policy Gradient (DDPG)** algorithm was implemented to train the agent.
DDPG is an off-policy actor-critic algorithm designed for continuous action spaces. It consists of:

* **Actor network:** Proposes continuous actions given a state.

* **Critic network:** Estimates the Q-value of state-action pairs to evaluate the actor’s actions.

* **Replay buffer:** Stores past experiences to stabilize training.

* **Soft target updates:** Gradual update of target networks.

* **Exploration noise:** Ornstein-Uhlenbeck process to encourage exploration.

**Note:** Adaptive Noise Decay: To encourage exploration during the early stages of training and gradually shift toward exploitation (σ *= 0.995) at each step.


### 3.2 Network Architecture

* **Actor network:** 3 fully connected layers with 400, 300 neurons, ReLU activations; output layer uses tanh activation to keep actions within valid range.

* **Critic network:** 3 fully connected layers; the first layer processes state inputs, followed by concatenation with actions and further layers with ReLU activations.

### 3.3 Hyperparameters
Hyperparameters with **same** values used for all scenarios:

    BATCH_SIZE = 128        # Number of experiences sampled per training step to update the networks.
    GAMMA = 0.99            # Discount factor determining how much future rewards are taken into account.
    TAU = 1e-3              # Rate at which target networks softly track the learned networks.
    WEIGHT_DECAY = 0        # L2 regularization term to prevent overfitting, set to zero here.
    sigma_decay = 0.995     # Factor by which exploration noise decreases gradually over time.
    min_sigma = 0.05        # Minimum allowed value for noise level to ensure some exploration remains.

Hyperparameters with **different** values used for different training scenarios:

    BUFFER_SIZE: Maximum number of experiences stored in the replay buffer.
    LR_ACTOR: Learning rate used to update the actor (policy) network.
    LR_CRITIC: Learning rate used to update the critic (value) network.

## 4. Plot of Rewards According to Hyperparameters

### Trail 1: Training on cpu with an agent

Environment solved in 1083 episodes!Average Score: 31.08
  
 <img src="img/rewards_cpu1.jpg" style="float: left;"/>
 


### Trail 2: Training on cpu with an agent

Environment solved in 1652 episodes!Average Score: 31.05

### Trail 3: Training on gpu with 20 agents

 Environment solved in 111 episodes!	Average Score: 30.17
 
 <img src="img/rewards_gpu.jpg" style="float: left; width: 50%;"/>

## 5. Challenges and Solutions

* Balancing exploration noise to avoid destabilizing learning.

* Tuning network architectures and hyperparameters for better convergence.

* Adjusting replay buffer size for effective learning.

## 6. Results

* The agent was able to consistently reach the target within about 100 episodes.

* Average rewards increased steadily during training, surpassing the success threshold (+30).

* The training process was stable with no significant performance drops.

* In the tests, it was observed that models trained by working with 20 agents together generalized the environment better than models trained by only one agent. 20 ajan gpu uzerinde ve bir ajan ise cpu uzerinde farkli bilgisayar sistemlerinde eitilmislerdir. BUFFER_SIZE haricinde ki diger hyperparameterler de aynidir.

## 7. Ideas for Future Work

* Try advanced algorithms: Implement TD3 (Twin Delayed DDPG) or SAC (Soft Actor-Critic) to improve learning stability and sample efficiency.

* Parameter tuning: Experiment with different learning rates, noise parameters, and network sizes to optimize training speed and final performance.

* Multi-agent training: Extend the approach to multi-agent environments to test scalability.
