<a href="https://colab.research.google.com/github/anngeorge12/LunarLanding/blob/main/Deep_Q_Learning_for_Lunar_Landing_Ann.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Q-Learning for Lunar Landing

## Part 0 - Installing the required packages and importing the libraries

### Installing Gymnasium

In [None]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


### Importing the libraries

In [None]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.autograd as autograd
from torch.autograd import Variable
from collections import deque, namedtuple

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

In [None]:
class Network(nn.Module):
  def __init__(self, state_size, action_size, seed = 42):
      super(Network, self).__init__()
      self.seed=torch.manual_seed(seed)
      self.fc1 = nn.Linear(state_size, 64)
      self.fc2 = nn.Linear(64, 64)
      self.fc3 = nn.Linear(64, action_size)

  def forward(self, state):
    x = self.fc1(state)
    x = F.relu(x)
    x = self.fc2(x)
    x = F.relu(x)
    return self.fc3(x)


in the above function state_size = 8 as
there are 8 i/p vectors and action_size = 4 as there are 4 different actions
*   second part of the code is to coonect the neurons of the input to the layers(hidden)
*   since there is no proper info on how many neurons are connected an optimal survey was 64,as the value for neurons to be connected(8 i/p--> 64 neurons)
*   2 hidden layers of 64 neurons each are created(64 --> 64 neurons)
*   the last hidden layer is connected to the o/p layer with action_size=4
(64 neurons --> 4 o/p)





function forward infos on how to create a forward propagation with i/p to the hidden layers(intermediate layers):
1. the x variable hold the current i/p state
2. the details from the current i/p state is passed to the intermediates using a rectifier function relu().
3. the same is done with the other layers back and forth and the final layer fc3 is returned.

## Part 2 - Training the AI

### Setting up the environment

In [None]:
import gymnasium as gym
env = gym.make('LunarLander-v2')
state_shape = env.observation_space.shape
state_size = env.observation_space.shape[0]
number_actions = env.action_space.n
print('State Shape: ',state_shape)
print('State Size: ',state_size)
print('Action nos: ',number_actions)

State Shape:  (8,)
State Size:  8
Action nos:  4


1. import the gymnasium package to get the lunar env
2. state_shape vector holds 8 vectors
3. state_size,variable that holds all the 8 i/p params at each time stamp
4. number_actions is to work through the 4 actions.

### Initializing the hyperparameters

In [None]:
learning_rate = 5e-4
minibatch_size = 100
discount_factor = 0.99
replay_buffer_size = int(1e5)
interpolation_parameter = 1e-3

  and should_run_async(code)


1. learning_rate is given that number as it was found as the optimal no for training the AI properly(might differ in others)
2. minibatch variable is the no.of obs used to update the moral params(normally 100)
3. discount factor variable the present value for future rewards
if df is lower the agent considers only the current rewards
if df is higher the the agent will consider the rewards obtained due to future actions
4. replay buffer size is to implement experience replay(the memory of the AI to have multiple chances to adapt to a certain experience)
giving it a value of 10^5 is to have that many experiences.
5. interpolation variable is the param used to work around the sub updates given by the target.
Note: to represent power values
1e10 means 10 to the power of 10.

### Implementing Experience Replay

In [None]:
class ReplayMemory(object):
  def __init__(self, capacity):
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.capacity = capacity
    self.memory = []

  def push(self, event):
    self.memory.append(event)
    if len(self.memory)>self.capacity:
      del self.memory[0]

  def sample(self, batch_size):
    experiences = random.sample(self.memory, k=batch_size)
    states = torch.from_numpy(np.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)
    actions = torch.from_numpy(np.vstack([e[1] for e in experiences if e is not None])).long().to(self.device)
    rewards = torch.from_numpy(np.vstack([e[2] for e in experiences if e is not None])).float().to(self.device)
    next_states = torch.from_numpy(np.vstack([e[3] for e in experiences if e is not None])).float().to(self.device)
    dones = torch.from_numpy(np.vstack([e[4] for e in experiences if e is not None])).to(torch.uint8).float().to(self.device)
    return states,next_states,actions,rewards,dones


this class to implement exp replay, the capacity param holds the memory space of the system
1. the device param is if you need to execute the code outside collab.
2. the memory is a list which contains the details of the state,action,reward and next state and if its done or not

push() method:
1. used to add the exp into memo buffer and make sure not to exceed the capacity
2. event is the param which os added to the memo
3. if memo is full the initial index of memo is freed as it is the oldest action presnt in the list.

sample() method:
1. takes batch_size as param to select no.of exp in one batch.
2. exp variable takes any one exp from the memo and collects acc to the batch size.
3. we loop through all the experiences and a stack them in order and to convert this numpy format data into torch the above statement is done
4. we make sure we collect those exp that are available
note:
all the states are stacked together into torch tensors.

### Implementing the DQN class

In [None]:
class Agent():
  def __init__(self, state_size, action_size):
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.state_size = state_size
    self.action_size = action_size
    self.local_qnetwork = Network(state_size, action_size).to(self.device)
    self.target_qnetwork = Network(state_size, action_size).to(self.device)
    self.optimizer = optim.Adam(self.local_qnetwork.parameters(),lr = learning_rate)
    self.memory = ReplayMemory(replay_buffer_size)
    self.t_step = 0

  def step(self, state, action, reward, next_state, done):
    self.memory.push((state, action, reward, next_state, done))
    self.t_step = (self.t_step+1) % 4
    if self.t_step == 0:
      if len(self.memory.memory)> minibatch_size:
        experiences = self.memory.sample(100)
        self.learn(experiences, discount_factor)

  def act(self, state, epsilon=0.):
    state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
    self.local_qnetwork.eval()
    with torch.no_grad():
      action_values = self.local_qnetwork(state)
      self.local_qnetwork.train()
      if random.random()>epsilon:
        return np.argmax(action_values.cpu().data.numpy())
      else:
        return random.choice(np.arange(self.action_size))

  def learn(self, experiences, discount_factor):
    states, next_states, actions, rewards, dones = experiences
    next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
    q_targets = rewards+(discount_factor*next_q_targets*(1-dones))
    q_expected = self.local_qnetwork(states).gather(1, actions)
    loss = F.mse_loss(q_expected, q_targets)
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
    self.soft_update(self.local_qnetwork, self.target_qnetwork,interpolation_parameter)

  def soft_update(self, local_model, target_model, interpolation_parameter):
    for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
      target_param.data.copy_(interpolation_parameter * local_param.data +(1.0 - interpolation_parameter) * target_param.data)




creating the Agent to deal with the inputs and ouputs.
1. we creating 2 Q-networks(local and target Q-network)
2. the optimizer is the instance of the adam class.
3. it takes the weights of the local network(this is done to updatae the weights so the lander or agent can adapt to the actions)
4. the learning rate is also sent to the agent
5. memory holds the memo size of the system thats been initialised
6. t_step is basically the variable which states at what time updation of the weights must take place

step() method:
the method that is used to store the exp and decides when to learn from them.
1. since push method takes an event to store into memo the parameters are decomposed as a tuple and the sent.
2. the t_step variable is taken to be updated after every four steps.
3. then you check if the memory variable of the same class which is linked to the memory variable of the ReplayMemory class is greater than the minibatch_size
4. if the condition is true then some experiences are selected in random to form a minibatch for the agent to learn from.
5. the learn method will be made later.....

act() method:
selects an action based on the given state using some action selection policy.
1. state and epsilon is the method which is used to select a preferred action
2. state is a numoy array therefore we are going to make it a torch tensor
3. since we make almost 100 obs in one batch at a time this batch is also added as a vector dimension to refer to which batch it is from.
4. index is put to zero so that all the batch is in the beginning
5. we set the local qnetwrk to eval mode in order to make initial predictions so the the obtained y cap can be compared.
6. the action value is obtained correspomding to the qnetwork value and go back to training mode
7. based on the action values using epsilon greedy method we decide which is the the highest q value in order to perform action to move to the next state

learn() method:
used to update the agents q values or weights using prev experiences.
1. first unpack all the exp into states,next_states, action,rewards and dones.
2. we get the max value from the target network using the next_q_targets variable which detaches the value from the tensor.
3. q_targets is a variable which calculates the value for the next state(bellmans eqn)
4. q_expected is the predicted q value so we can compute the loss which is back prpogated to update into new q values.


soft_update() method
this updates the params of the target network
1. it takes the local and target network along with interpolation by looping.
2. the zip func helps in taking 2 params at the same time for the looping.
3. the soft update happens by taking the avg of the local and targetted params.
4. the formula is to prevent abrupt the changes in the env which can destabilize the agents learning.

### Initializing the DQN agent

In [None]:
agent = Agent(state_size, number_actions)

### Training the DQN agent

In [None]:
number_episodes = 2000
maximum_no_timesteps = 1000
epsilon_start_value = 1.0
epsilon_end_value = 0.01
epsilon_decay_value = 0.995
epsilon = epsilon_start_value
scores_on_100_eps = deque(maxlen = 100)

for episode in range(1, number_episodes+1):
  state, _ =env.reset()
  score  = 0
  for t in range(maximum_no_timesteps):
    action = agent.act(state, epsilon)
    next_state, reward, done, _, _ = env.step(action)
    agent.step(state, action, reward, next_state, done)
    state = next_state
    score+= reward
    if done:
      break
  scores_on_100_eps.append(score)
  epsilon = max(epsilon_end_value, epsilon_decay_value * epsilon)
  print('\rEpisode{}\tAverage Score:{:.2f}'.format(episode, np.mean(scores_on_100_eps)), end = "")
  if episode%100==0:
    print('\rEpisode{}\tAverage Score:{:.2f}'.format(episode, np.mean(scores_on_100_eps)))
  if np.mean(scores_on_100_eps)>=200.0:
    print('\nEnv solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode - 100, np.mean(scores_on_100_eps)))
    torch.save(agent.local_qnetwork.state_dict(), 'checkpoint.pth')
    break






Episode100	Average Score:-106.06
Episode200	Average Score:-25.48
Episode300	Average Score:80.37
Episode400	Average Score:201.30

Env solved in 300 episodes!	Average Score: 201.30


Training DQN Agent:
1. the no.of episodes param is for keeping track how many learnings the agent should take.
2. max no.of time_step per episode to make sure the agent is not stuck in one thing for a long time.
the variables all are initialised above:

Training the agent:
1. looping through all the episodes
2. state, _ this is used to reset the agent to its initial state. the underscore is to make sure that uneccesary info is discarded
3. the core variable is to calculate all the cumulative freq.
4. another for loop is to loop through the time steps for each episode.
5. the act method of thr agent class is called to select an action
6. the step and learn method are called byback propogating.
Dynamic Print:
to update the scores automatically.


## Part 3 - Visualizing the results

In [None]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display
from gym.wrappers.monitoring.video_recorder import VideoRecorder

def show_video_of_model(agent, env_name):
    env = gym.make(env_name, render_mode='rgb_array')
    state, _ = env.reset()
    done = False
    frames = []
    while not done:
        frame = env.render()
        frames.append(frame)
        action = agent.act(state)
        state, reward, done, _, _ = env.step(action.item())
    env.close()
    imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, 'LunarLander-v2')

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()

