<a href="https://colab.research.google.com/github/aggarwal-ujjwal/AI-ML/blob/main/Deep_Q_Learning_for_Lunar_Landing_Handwritten_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://gymnasium.farama.org/environments/box2d/lunar_lander/

In [None]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  swig4.0
Suggested packages:
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 35 not upgraded.
Need to get 1,116 kB of archives.
After this operation, 5,542 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig4.0 amd64 4.0.2-1ubuntu1 [1,110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig all 4.0.2-1ubuntu1 [5,632 B]
Fetched 1,116 kB in 3s (414 kB/s)
Selecting previously unselected package swig4.0.
(Reading database ... 126435 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.2-1ubuntu1_amd64.deb ...
Unpacking swig4.0 (4.0.2-1ubuntu1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.2-1ubunt

In [None]:
import os
import random
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd
import torch.nn.functional as F
from collections import deque, namedtuple



Creating the NN

In [None]:
class NeuralNetwork(nn.Module): #inheritance
  def __init__(self, input_size, action_size, seed = 42): #you can remove seed from here
    super(NeuralNetwork, self).__init__()
    self.seed = torch.manual_seed(seed)
    self.fc1 = nn.Linear(input_size, 64) #64 has to be calculated with expermentation
    self.fc2 = nn.Linear(64,64)
    self.fc3 = nn.Linear(64, action_size) #ends with 4 possible actions
  #noob way
  # def forward(self, state):
  #   x = self.fc1(state)
  #   x = F.relu(x)
  #   x = self.fc2(x)
  #   x = F.relu(x)
  #   return self.fc3(x)

  #better way
  def forward(self, input): #input vector with 8 values like- ([0.5, -0.2, 1.0, 0.8])
    """ # fc1(state): [8] -> [64] (linear transformation + bias)
        # F.relu(): applies ReLU activation (max(0, x))
        # x shape: [64] with all negative values clipped to 0"""
    x = F.relu(self.fc1(input))
    x = F.relu(self.fc2(x))
    actions = self.fc3(x)
    return actions


In [None]:
#Example for understanding - not to be used anywhere
x = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])
fc = nn.Linear(10, 64)
output = fc(x)
print(output)
print(x.shape, output.shape)
#Transforms input features into a higher-dimensional representation (10 → 64), allowing the network to learn more complex patterns.

tensor([-2.9912e-01, -3.4230e+00,  3.5476e+00,  2.2314e+00, -5.6016e+00,
         2.5683e+00,  1.2996e+00, -2.8921e+00, -2.7405e+00, -2.0260e+00,
         6.8871e+00, -3.4501e+00, -1.0745e+01, -2.0268e+00,  1.1376e+00,
        -4.9283e-01,  4.7010e-01,  2.5421e+00, -1.6787e+00, -4.8147e+00,
        -9.8752e-01, -9.9501e+00, -9.3120e+00,  1.4174e+00,  4.7613e+00,
         1.2533e+00,  3.4996e+00, -2.8903e+00,  1.8413e+00,  3.6027e-01,
        -1.9338e+00, -3.8423e+00,  1.9923e+00, -2.8707e+00,  1.1167e+00,
         3.4304e+00,  2.0469e+00,  1.6896e+00, -1.6622e+00, -4.7463e-01,
         3.0907e+00,  1.1544e+00,  2.5109e+00, -1.3669e+00, -2.1406e-01,
        -3.3151e+00, -2.3495e+00, -2.3412e+00,  3.0829e+00,  5.7850e+00,
         4.1797e+00,  5.5170e+00, -5.9128e+00,  4.0458e+00, -4.6105e+00,
        -2.9455e+00,  3.3453e+00,  1.0292e-02, -6.8615e+00,  2.3503e+00,
        -4.5343e+00,  1.1474e+00,  2.4058e+00, -1.1361e+00],
       grad_fn=<ViewBackward0>)
torch.Size([10]) torch.Size([64

Training the NN


In [None]:
import gymnasium as gym
env = gym.make('LunarLander-v3') # will get it from the documentation
print(env.action_space) #action - moving in 4 directions
print(env.observation_space) #input - 8 values

input_size = env.observation_space.shape[0]
action_size = env.action_space.n

print(input_size, action_size)



Discrete(4)
Box([ -2.5        -2.5       -10.        -10.         -6.2831855 -10.
  -0.         -0.       ], [ 2.5        2.5       10.        10.         6.2831855 10.
  1.         1.       ], (8,), float32)
8 4


Initializing the hyperparameters

In [None]:
learning_rate = 5e-4 #learning_rate = 5/10000 but this is the standard writing format
minibatch_size = 100
discount_factor = 0.99
replay_buffer_size = int(1e5)
interpolation_parameter = 1e-3 #Also known TAU

Learning_rate = 5e-4

What: Controls how big steps the neural network takes when updating weights

Purpose: Too high = unstable learning, too low = slow learning

5e-4: A good middle ground for most deep learning tasks

minibatch_size = 100

What: Number of experiences processed together in one training step

Purpose: Balances training stability vs computational efficiency

100: Good compromise between noise reduction and memory usage

discount_factor = 0.99

What: How much future rewards matter compared to immediate rewards

Purpose: 0 = only care about immediate reward, 1 = future rewards matter equally
0.99: Values future rewards highly but slightly less than immediate ones

replay_buffer_size = int(1e5)

What: Maximum number of past experiences stored in memory

Purpose: Allows agent to learn from diverse past experiences, not just recent ones

100,000: Large enough for good diversity, small enough to fit in memory

interpolation_parameter = 1e-3 (tau)

What: How quickly the target network updates toward the main network

Purpose: Stabilizes training by slowly updating the target used for loss calculation

0.001: Very slow updates = more stable training (common in DQN/DDPG)

These are typical Deep Q-Network (DQN) or Deep Deterministic Policy Gradient (DDPG) hyperparameters for reinforcement learning.



**ReplayMemory** implements an experience replay buffer that stores and samples past experiences to train a reinforcement learning agent more effectively.

In [1]:
class ReplayMemory():
  def __init__(self, capacity):
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.capacity = capacity
    self.memory = []

  def push(self, event):
    self.memory.append(event)
    if len(self.memory) > len(self.capacity):
      del self.memory[0]

  def sample(self, batch_size):
    experiences = random.sample(self.memory, k=batch_size)
    states = torch.from_numpy(np.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)
    actions = torch.from_numpy(np.vstack([e[1] for e in experiences if e is not None])).float().to(self.device)
    rewards = torch.from_numpy(np.vstack([e[2] for e in experiences if e is not None])).float().to(self.device)
    next_states = torch.from_numpy(np.vstack([e[3] for e in experiences if e is not None])).float().to(self.device)
    dones = torch.from_numpy(np.vstack([e[4] for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device)
    return (states, actions, rewards, next_states, dones)


[e[0] for e in experiences] - extracts state (index 0) from each experience

np.vstack() - stacks arrays vertically into a batch

torch.from_numpy() - converts NumPy array to PyTorch tensor

.float() - ensures float32 data type

.to(self.device) - moves tensor to GPU/CPU


Each tuple: (state, action, reward, next_state, done)

Eg.
self.memory = [
    (np.array([0.1, 0.2, 0.05, 0.1]), 0, 1.0, np.array([0.15, 0.25, 0.06, 0.12]), False),
    (np.array([0.15, 0.25, 0.06, 0.12]), 1, 1.0, np.array([0.12, 0.22, 0.04, 0.08]), False),
    (np.array([0.12, 0.22, 0.04, 0.08]), 0, 1.0, np.array([0.18, 0.28, 0.07, 0.14]), False),
    # ... more experiences
]

batch_size = 2
experiences = random.sample(self.memory, k=2)
# Result: [(experience_1), (experience_3)]

[
    np.array([0.1, 0.2, 0.05, 0.1]),    # from experience_1
    np.array([0.12, 0.22, 0.04, 0.08])  # from experience_3
]

np.vstack([...])

np.array([
    [0.1,  0.2,  0.05, 0.1 ],   # batch item 0
    [0.12, 0.22, 0.04, 0.08]    # batch item 1
])

states = tensor([
    [0.1000, 0.2000, 0.0500, 0.1000],
    [0.1200, 0.2200, 0.0400, 0.0800]
], device='cuda:0')  # Shape: torch.Size([2, 4])

Summary

Input: List of experience tuples in memory

Process: Random sample → Extract states → Stack → Convert to tensor

Output: Batched tensor ready for neural network training

Learning: keyword only parameters

In [None]:
#def sample(self, population, k, *, counts=None): Where it came from? I opened documentation of random.py
#                              ^
#                    Everything after * must be keyword-only

# Why use keyword-only parameters?
# Clarity: Forces explicit naming of optional parameters
# API stability: Can add new parameters without breaking existing calls
# Prevents errors: Avoids accidental positional argument mistakes
def greet(name, *, greeting="Hello"):
    return f"{greeting}, {name}!"

# Valid calls:
greet("Alice")                    # greeting uses default
greet("Alice", greeting="Hi")     # keyword argument

# Invalid call:
greet("Alice", "Hi")              # Error! greeting must be keyword

def process_data(data, *, sort=False, reverse=False, limit=None):
    # Implementation here
    pass

# Valid calls:
process_data([1, 2, 3])
process_data([1, 2, 3], sort=True)
process_data([1, 2, 3], sort=True, reverse=True, limit=10)

# Invalid calls:
process_data([1, 2, 3], True)           # Error!
process_data([1, 2, 3], True, False)    # Error!
