# Getting start

## Introduction for game 

**Observation Space**
The state is an 8-dimensional vector: 

* 1-2 the coordinates of the lander in x & y, 
* 3-4 its linear velocities in x & y, its angle, 
* 5-6 its angular velocity
* 7-8 two booleans that represent whether each leg is in contact with the ground or not.

**Action Space**
There are four discrete actions available:

```
0: do nothing
1: fire left orientation engine
2: fire main engine
3: fire right orientation engine
```

**Rewards**
* is increased/decreased the closer/further the lander is to the landing pad.
* is increased/decreased the slower/faster the lander is moving.
* is decreased the more the lander is tilted (angle not horizontal).
* is increased by 10 points for each leg that is in contact with the ground.
* is decreased by 0.03 points each frame a side engine is firing.
* is decreased by 0.3 points each frame the main engine is firing.

The episode receive an additional reward of -100 or +100 points for crashing or landing safely respectively.<p/>
An episode is considered a solution if it scores at least 200 points.

## Create [gym env](https://gymnasium.farama.org/introduction/basic_usage/)

1. Create conda env from environment.yml

```
    mamba env create -f environment.yml
```

2. Add `gymnasium` to environment

```
    mamba install -c conda-forge gym==1.0.0
    mamba install conda-forge::pygame
````

3. Import gymnasium into project

4. Create gym env 'LunarLander-v3'

> This environment is part of the Box2D environments which contains general information about the environment.
> Action Space: Discrete(4)
> Observation Space:
> Box([ -2.5 -2.5 -10. -10. -6.2831855 -10. -0. -0. ], [ 2.5 2.5 10. 10. 6.2831855 10. 1. 1. ], (8,), float32)



In [None]:
import gymnasium as gym

env = gym.make("LunarLander-v3", render_mode="human")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# print(state_size, action_size)
# 8 4

## Add deep learning dependencies

1. install tensorflow & keras

```
    mamba install conda-forge::tensorflow
    mamba install conda-forge::keras

````

2. import tensorflow packages

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

## Add matplotlib

```
    mamba install conda-forge::matplotlib
```

## Implement DQN Algorithm

1. Define experience replay buff
2. Create both current and target networks
3. Play and get feedbacks from Env
4. Collect the feedbacks and cache them in buff
5. Tain the target network by simpling data from buff
6. Update current network

In [None]:
# Hyperparameters
GAMMA = 0.99
LEARNING_RATE = 0.0005
MEMORY_SIZE = 100000
BATCH_SIZE = 64
EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995
TARGET_UPDATE_FREQ = 1000
EPISODES = 20


def preprocess_state(state):
    return np.reshape(state, [1, state_size])

DQNAgent class contains all network implementations and expose interfaces for updating target network

Topology of Network

```
+-------------------+       +-------------------+       +-------------------+       +-------------------+
|    Input Layer    |       |   Hidden Layer 1  |       |   Hidden Layer 2  |       |    Output Layer   |
|                   |       |                   |       |                   |       |                   |
|   8-dimensional   |------>|    64-dimensional  |------>|    64-dimensional  |------>|    4-dimensional   |
|   (State Space)   |  W₁   |     (ReLU)        |  W₂   |     (ReLU)        |  W₃   |    (Linear)       |
+-------------------+       +-------------------+       +-------------------+       +-------------------+
      ↑
      | (8 values)
      | [x,y,x_vel,y_vel,angle,angular_vel,left_leg,right_leg]
      |
Environment
```

Full Parameter Count:

```
┌──────────────┬───────────────────┐
│ Layer        │ Parameters        │
├──────────────┼───────────────────┤
│ Input→Hidden1│ 576               │
│ Hidden1→2    │ 4,160             │
│ Hidden2→Out  │ 260               │
├──────────────┼───────────────────┤
│ TOTAL        │ **4,996**         │
└──────────────┴───────────────────┘
```

Activation Flow:

```
State → ReLU(ReLU(State·W₁ + b₁)·W₂ + b₂)·W₃ + b₃ → Q-values
```

Output Interpretation:

```
[Do nothing, Fire left, Fire main, Fire right]
```

In [None]:
from queue import Queue
from collections import deque

class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.expirence_replay_buff = deque(maxlen=MEMORY_SIZE)
        self.exploration_rate = EXPLORATION_MAX
        
        # Main(Current) network
        self.model = self._build_model()
        
        # Target network
        self.target_model = self._build_model()
        self.target_model.set_weights(self.model.get_weights())
    
    def _build_model(self):
        model = tf.keras.Sequential([
            Dense(64, input_dim=self.state_size, activation='relu'),
            Dense(64, activation='relu'),
            Dense(self.action_size, activation='linear')
        ])
        model.compile(loss='mse', optimizer=Adam(learning_rate=LEARNING_RATE))
        return model
    
    def remember(self, state, action, reward, next_state, done):
        self.expirence_replay_buff.append((state, action, reward, next_state, done))
    
    def act(self, state):
        if np.random.rand() < self.exploration_rate:
            return np.random.randint(self.action_size)
        q_values = self.model.predict(state, verbose=0)
        return np.argmax(q_values[0])
    
    def replay(self):
        if len(self.expirence_replay_buff) < BATCH_SIZE:
            return
        
        minibatch = np.random.choice(len(self.expirence_replay_buff), BATCH_SIZE, replace=False)
        states = np.zeros((BATCH_SIZE, self.state_size))
        next_states = np.zeros((BATCH_SIZE, self.state_size))
        actions, rewards, dones = [], [], []
        
        for i, idx in enumerate(minibatch):
            state, action, reward, next_state, done = self.expirence_replay_buff[idx]
            states[i] = state
            next_states[i] = next_state
            actions.append(action)
            rewards.append(reward)
            dones.append(done)
        
        # Current Q values
        current_q = self.model.predict(states, verbose=0)
        
        # Target Q values
        target_q = self.target_model.predict(next_states, verbose=0)
        max_target_q = np.amax(target_q, axis=1)
        
        for i in range(BATCH_SIZE):
            if dones[i]:
                current_q[i][actions[i]] = rewards[i]
            else:
                current_q[i][actions[i]] = rewards[i] + GAMMA * max_target_q[i]
        
        # Train the model
        self.model.fit(states, current_q, verbose=0)
        
        # Decay exploration rate
        self.exploration_rate = max(EXPLORATION_MIN, 
                                  self.exploration_rate * EXPLORATION_DECAY)
    
    def update_target(self):
        self.target_model.set_weights(self.model.get_weights())


In [None]:
import matplotlib.pyplot as plt

total_rewards = np.empty(EPISODES)  # List of rewards per episode
exploration_rates = np.empty(EPISODES)  # List of exploration rates per episode


In [None]:

import time


agent = DQNAgent(state_size, action_size)

# Training loop

for e in range(EPISODES):
    init_state, _ = env.reset()
    state = preprocess_state(init_state)
    total_reward = 0
    step = 0
    
    env_start = time.time()
    while True:
        step += 1
        env.render()

        now = time.time()
        t = now - env_start
        env_start = now 
        # print(f"1 - {t:.4f} seconds")

        action = agent.act(state)

        now = time.time()
        t = now - env_start
        env_start = now 
        # print(f"2 - {t:.4f} seconds")

        next_state, reward, terminated, truncated, _ = env.step(action)

        now = time.time()
        t = now - env_start
        env_start = now 
        # print(f"3 - {t:.4f} seconds")

        done = terminated or truncated
        
        next_state = preprocess_state(next_state)
        agent.remember(state, action, reward, next_state, done)

        now = time.time()
        t = now - env_start
        env_start = now 
        # print(f"4 - {t:.4f} seconds")

        state = next_state
        total_reward += reward
        
        agent.replay()


        now = time.time()
        t = now - env_start
        env_start = now 
        # print(f"5 - {t:.4f} seconds")

        if step % TARGET_UPDATE_FREQ == 0:


            now = time.time()
            t = now - env_start
            env_start = now 
            # print(f"6 - {t:.4f} seconds")

            agent.update_target()

            now = time.time()
            t = now - env_start
            env_start = now 
            # print(f"7 - {t:.4f} seconds")
        
        if done:
            total_rewards[e] = total_reward
            exploration_rates[e] = agent.exploration_rate
            print(f"Episode: {e+1}/{EPISODES}, Score: {total_reward:.2f}, Exploration: {agent.exploration_rate:.2f}")
            break

env.close()


## Show the learning performance

In [None]:

episodes = np.arange(1, EPISODES+1)

# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

# Plot Total Reward
ax1.plot(episodes, total_rewards, 'b-', linewidth=1)
ax1.set_title('Training Performance')
ax1.set_ylabel('Total Reward', color='b')
ax1.tick_params('y', colors='b')
ax1.grid(True, alpha=0.3)

# Plot Exploration Rate
ax2.plot(episodes, exploration_rates, 'r-', linewidth=1)
ax2.set_xlabel('Episodes')
ax2.set_ylabel('Exploration Rate', color='r')
ax2.tick_params('y', colors='r')
ax2.grid(True, alpha=0.3)

# Add horizontal line at solved threshold (LunarLander: 200)
ax1.axhline(y=200, color='g', linestyle='--', label='Solved Threshold')
ax1.legend()

plt.tight_layout()
plt.show()