# Lunar Landing using Reinforcement Learning

*by Vishal Shetty, November 5, 2021* 

## Introduction

Summarize the machine learning concepts, algorithms, and data that interest you.  Describe why you are interested in these:

1: The concept of reinforcement learning piqued my interest for its applications in control theory, when dealing with uncertain conditions.

2: The recent strides made in reinforcement learning make it relatively easy to apply learning based control strategies towards applications in complex systems, where we as humans are not able to see underlying patterns which lead to an optimal solution.

Very briefy describe your planned methods:

1: Use the enviornments provided by OpenAI on their platform, to train and test the Temporal Difference-TD(0) with Q-learning algorithm implemented in class.

## Methods

Describe in some detail the algorithms and data you will use:

1: The LunarLander-v2 enviornment will be used from the OpenAI gym to train and test our reinforcement learning agent.

2: Below cells describe the details about the enviornment : States , Actions, Rewards

In [33]:
## Citation[1]
import gym
env = gym.make('LunarLander-v2')
states = env.observation_space ## The observations provided by the enviornment
actions = env.action_space     ## The action space for the agent to select its actions from
print(actions,states,states.shape)

Discrete(4) Box([-inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf], (8,), float32) (8,)


### Action Space:   

Discrete (Discrete Action Space with 4 values):

0 = Do Nothing

1 = Fire Left Engine

2 = Fire Main Engine

3 = Fire Right Engine [2]

### Observation Space:
The observation space for the Lunar Lander is illustrated by a "Box" containing 8 values between [ $-\infty$, $\infty$ ] these values are:

Position X

Position Y

Velocity X

Velocity Y

Angle

Angular Velocity

Is left leg touching the ground: 0 OR 1

Is right leg touching the ground: 0 OR 1 [2]

### Reward System:

Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite.[3]

### Below is a test run of the lunar lander without any training based on random actions

In [34]:
## Citation[1]
##To run the below code you would need below packages
##pip install gym
##conda install swig
##pip install box2d
##pip install pyglet
import gym
env = gym.make('LunarLander-v2')
for i_episode in range(11):
    observation = env.reset()#state initailized to a random point
    done = False # information on terminal state
    score = 0 #initializing reward
    while not done:
        env.render()
        #print(observation)
        action = env.action_space.sample() ##random action selection without any training
        next_observation, reward, done, info = env.step(action)##nextstate_reward_terminalstate
        score +=reward ##reward_update_step
    print('Episode:{} Score:{}'.format(i_episode, score))
env.close()

Episode:0 Score:-382.76200053806633
Episode:1 Score:-64.8373431659697
Episode:2 Score:-137.30806286788524
Episode:3 Score:-59.221472281021676
Episode:4 Score:-186.10538350627127
Episode:5 Score:-114.52278555700988
Episode:6 Score:-113.16441953486438
Episode:7 Score:-351.54791324452214
Episode:8 Score:-244.82510361301874
Episode:9 Score:-120.8438457367883
Episode:10 Score:-139.68522829057528


We can see that the rewards accumulate to give a negative score in all 10 episodes with random action selection.
Our goal with the RL agent implementation would be to get a high positive accumulated reward.

### Algorithms to be employed:
#### Temporal Difference-TD(0) :
Temporal difference learning can be defined as a chain of predictions, it does not calculate total future reward, TD simply tries to predict the combination of immediate reward and its own reward prediction at the next moment in time.Then, when the next moment comes, bearing new information, the new prediction is compared against what it was expected to be. If they’re different, the algorithm calculates how different they are, and uses this “temporal difference” to adjust the old prediction toward the new prediction. [ 4 ]

Mathematically:

value(st,at) ≈ rt+1 + value(st+1,at+1)  [ 5]

#### Q-Learning :
The objective for any reinforcement learning problem is to find the sequence of actions that maximizes (or minimizes) the sum of reinforcements along the sequence. This is reduced to the objective of acquiring the Q function that predicts the expected sum of future reinforcements, because the correct Q function determines the optimal next action. [ 6 ]

Mathematically : 

Q(st,at) ≈  ∑k=0∞ (rt+k+1)         [6]                  

#### Policy : Epsilon-greedy :
Epsilon (ϵ) indicates the probability of taking a random action. Its value is from 0 to 1. Given a Qnet, the current state, the set of valid_actions and epsilon, we can define a function that either returns a random choice from valid_actions or the best action determined by the values of Q produced by Qnet for the current state and all valid_actions. This is referred to as an ϵ-greedy policy. [ 6 ]

#### In your proposal, make a table here with at least 5 milestones for your project with expected dates: 

1: 6th November 2021 to 10th November 2021 : Setting up OpenAI enviornment.

2: 11th November 2021 to 20th November 2021 : Implementation of RL agent on the OpenAI enviornment

3: 21st November 2021 to 25th November 2021 : Tuning hyperparameters for optimal performance.

4: 26th November 2021 to 5th December 2021 : Analysis based on the performance of the RL agent

3: 6th December 2021 to 10th December 2021 : Presentation of results and demo.

## Results

In the proposal, summarize what you expect your results to be:

1: Since there is infinite fuel, the RL agent would learn to hover/fly first.[2]

2: Achieving a positive score value (that is positive accumulated reward) for each episode.

## Conclusions

In your proposal, describe what you expect to learn:

1: Behavior of RL agent under different hyperparameters

What you expect will be most difficult:

1: Implementing the RL agent on the OpenAI enviornment.


### References

* OpenAI gym enviornment documentation : https://gym.openai.com/docs/#environments .
* OpenAI gym enviornment - LunarLander - github : https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py
* OpenAI gym enviornment - LunarLander : https://gym.openai.com/envs/LunarLander-v2/ 
* Temporal Difference : https://deepmind.com/blog/article/Dopamine-and-temporal-difference-learning-A-fruitful-relationship-between-neuroscience-and-AI
* Lecture 17 - CS545: https://nbviewer.org/url/www.cs.colostate.edu/~anderson/cs545/notebooks/15%20Introduction%20to%20Reinforcement%20Learning.ipynb
* Lecture 18 - CS545:
https://nbviewer.org/url/www.cs.colostate.edu/~anderson/cs545/notebooks/16%20Reinforcement%20Learning%20with%20Neural%20Network%20as%20Q%20Function.ipynb


In [32]:
import io
import nbformat
import glob
nbfile = glob.glob('Shetty-ProjectProposal.ipynb')
if len(nbfile) > 1:
    print('More than one ipynb file. Using the first one.  nbfile=', nbfile)
with io.open(nbfile[0], 'r', encoding='utf-8') as f:
    nb = nbformat.read(f, nbformat.NO_CONVERT)
word_count = 0
for cell in nb.cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print('Word count for file', nbfile[0], 'is', word_count)

Word count for file Shetty-ProjectProposal.ipynb is 822
