# Lunar Lander Reinforcement Learning Agent
### Using OpenAI Gym
#### Charlie Bailey (peba2926)

## Short Overview

The goal of this project is to create an intelligent agent that can learn and solve the [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/) Environment from OpenAI Gym. In this environment, a lunar lander agent seeks to solve a classic rocket trajectory optimization problem. The agent's goal is to safely land the vehicle between two yellow flags on a rocky moon surface.

![Lunar Lander Demo GIF](../assets/lunar_lander_demo.gif)

### State, Action and Reward Descriptions
The state space of this environment is continuous and represented as an 8-dimensional vector that describes the landers x & y coordinates, x & y linear velocity, its angle, its angular velocity and two booleans that represent whether each leg is in contact with the ground or not.

To navigate this environment, the agent has 4 discrete action choices it can make: 
* 0: do nothing
* 1: fire left orientation engine
* 2: fire main engine
* 3: fire right orientation engine

As the agent interacts with the environment, a reward is granted at each step. For each step the reward is:
* is increased/decreased the closer/further the lander is to the landing pad.
* is increased/decreased the slower/faster the lander is moving.
* is decreased the more the lander is tilted (angle not horizontal).
* is increased by 10 points for each leg that is in contact with the ground.
* is decreased by 0.03 points each frame a side engine is firing.
* is decreased by 0.3 points each frame the main engine is firing.

The episode also gets an additional reward of -100 for crashing the lander, or +100 for landing safely.

Citation: Farama Foundation. (n.d.). Lunar Lander.

### Implementation
Since this is a continuous state space with discrete actions, my goal for this project is to implement an Approximate Q-learning agent and then extend this implementation utilizing a Deep Q-network (DQN). In order to maximize my learning and understanding, I plan to hand code the agent's Approximate Q-learning algorithms following the structure of the CS188 PacMan intelligent agent that we implemented in HW4A. Following this, I will attempt to integrate a DQN into my agent to compare architecture results. 

## Approach

All of the code for this project can be found in the project [GitHub repo](https://github.com/charliebailey24/RL-agent).

Following the example laid out in the CS188 PacMan implementation we built in HW4A, I wanted to keep a clean, modular structure for this project with good separation of concerns.

### Initialize the Environment
To get started, I first wanted to get familiar with the environment and understand how the lunar lander operated. I started by creating a simple random agent in the `randomAgent.py` file and ran it for a few episodes in human render mode so I could see the output. I also performed some logging to better understand the state and action space that I would be working with. The boilerplate code for this agent's implementation was derived from the `cartpole-updated.ipynb` provided in class.

### Integration with Approximate Q-learning Agent
After I had a developed a better understanding of the Lunar Landing environment, I set out to adapt the Approximate Q-learning Agent architecture to this new environment. The code for this implementation can be found in the `approxQLearningAgent.py` file.

Because there was a lot of inheritance in the original PacMan implementation, I decided to re-write all of this code from scratch to better understand the architecture and detailed flow of the agent's `update` logic. Overall, the final result ended up being very similar to implementation I built in the original assignment, but it was a useful learning exercise to further elucidate how the agent's learning update works in details.

At a high level, an Approximate Q-learning agent learns weights for features extracted from state-action pairs in a given environment. From the PacMan assignment, we know that the approximate Q-function takes the following form:

$$
Q(s,a) \;=\; \sum_{i=1}^n f_i(s,a)\,w_i
$$

where each weight $w_i$ is associated with a particular feature $f_i(s,a)$. This can be further broken down using the temporal-difference error derived from the Bellman equation:

$$
w_i \leftarrow w_i + \alpha \,\cdot\, \mathrm{difference}\,\cdot\,f_i(s,a)
$$

with

$$
\mathrm{difference}
\;=\;\bigl(r + \gamma\,\max_{a'} Q(s',a')\bigr)\;-\;Q(s,a)\,. 
$$

Since the original PacMan assignment provided predefined feature extractors, one of the most challenging tasks in adapting this approach to the Lunar Lander environment was designing a suitable feature extractor from scratch. Understanding precisely which features to use and how to represent them meaningfully for integration into the `ApproxQAgent` class required quite a bit of time in studying the original PacMan intelligent agent code base. The implementation can be seen in `lunarLanderEnv.py`.

Once I had the agent learning logic and was able to extractor features from the environment, the next step was to put everything together and actually train the agent in the Lunar Landing environment. Adapting again from the `cartypole-updated.ipynb`, I implemented a training loop in `runAgent.py`. This main program resets the environment, repeatedly selects actions based on the epsilon-greedy policy, and updates the Q-function weights using the approximate Q-function described above. Training continues until the specified number of episodes are completed, at which point the agent’s performance is analyzed.

With all of the component pieces of my RL system in place, it was time to start training the agent in the Lunar Lander environment.

## Results

### Training the Agent
The next step in my project was to train the agent and evaluate the performance. This portion of the project ended up being the most difficult. I started by implementing a `runLunarLander()` function that allowed me to manually determine the following hyperparameters:
* NUM_TRAINING_EPISODES = how many episodes to train the agent for
* ALPHA = learning rate
* EPSILON = exploration rate (epsilon-greedy implementation)
* GAMMA = discount factor

To start, I simply utilized the values from the default PacMan settings:
* ALPHA = 1.0
* EPSILON = 0.5
* GAMMA = 0.8

I then incrementally tested an increasing number of training episodes to get a baseline for my agent. I quickly hit an `overflow encountered` error. I realized that I was getting excessively large weights and after some debugging and research on hyperparameter tuning from the [Stable Baseline 3](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html) RL training tips docs, this was likely due to an ALPHA value of 1.0—which was far too high. I also realized that I should be normalizing the state values to be between -1 and 1. The Stable Baseline 3 documentation recommended looking at [RL Zoo](https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/dqn.yml) GitHub project to find tuned hyperparameters. Although the RL Zoo repo did not provide hyperparameters specifically tuned for Approximate Q-learning, the provided DQN parameters offered a reasonable starting point due to the similarity of underlying Q-learning mechanics.

Finally, following the advice of Sutton and Barto (p. 132):

`Of course, if ε were gradually reduced, then both methods would asymptotically converge to the optimal policy.`

I added exponential epsilon decay, implemented using a simple decay function designed to smoothly reduce epsilon from its initial value (EPSILON = 1.0) to a minimum (EPSILON_MIN = 0.01) over approximately 12% (DECAY_FRACTION = 0.12) of the total episodes.

With these fixes, I retrained my agent in the Lunar Landing environment with the following hyperparameters:
* NUM_TRAINING_EPISODES = 1000
* ALPHA = 0.001
* EPSILON = 1.0
* GAMMA = 0.99
* EPSILON_MIN = 0.01
* DECAY_FRACTION = 0.12

Here are the results from `Lunar Lander run 1` with the hyperparameters above:

![Lunar Lander 1k Episodes with a=0.001 e=1 g=0.99 em=0.01 df=0.12](../assets/lunarLander1k_a=0.001_e=1_g=0.99_em=0.01_df=0.12.png)

As we can see, the run was successful (no overflow errors), and it appears that some learning is taking place; however the agent is plateauing with a reward of -120.

From here, my hypothesis was that I was not training the agent for enough episodes to allow sufficient learning. For my next experiment, I upped the training to 10k episodes.

Here are the results of `Lunar Lander run 2` with all of the same hyperparameters as `run 1` except the number of episodes has been increased to 10k:

![Lunar Lander 10k Episodes with a=0.001 e=1 g=0.99 em=0.01 df=0.12](../assets/lunarLander10k_a=0.001_e=1.0_g=0.99_em=0.01_df=0.12.png)

As we can see from graph, while the agent appeared to be learning at 1k episodes, as we scale up to 10k episodes, the 100-episode trailing reward is flat and even dips pretty significantly at points in the training run.

I tried several other variations of the hyperparameters with this number of episodes and all of the runs turned out similar to this one. At this point, I was worried that there might be an issue with my Approximate Q-learning logic, so I decided to test my agent on the Cart Pole environment to determine whether the lack of learning was due to the complexity of the environment, or a bug in my learning logic. Cart Pole’s lower complexity makes it a suitable debugging environment because rapid learning improvements are expected, thus clearly indicating whether my implementation logic is sound.

### Debugging the Agent
In order to determine whether there was a bug in the Approximate Q-learning logic, or if the environment was simply too complex for Approximate Q-learning, I decided to test the agent in the `Cart Pole` environment. Luckily, due to the modular nature of my code base, this ended up being not too difficult to spin up (albeit with a lot of duplicate code to keep track of).

Once I had the feature extractor (with normalization) setup for the Cart Pole environment, I simply created a new `runCartPole()` function. I then revisited the RL Zoo hyperparameter recommendations for the Cart Pole environment. For this experiment, I used the following hyperparameters:

* NUM_TRAINING_EPISODES = 20_000
* ALPHA = 0.0023
* EPSILON = 1.0
* GAMMA = 0.99
* EPSILON_MIN = 0.04
* DECAY_FRACTION = 0.16

Here are the results of the `Cart Pole run 1` experiment:

![Cart Pole 20k Episodes with a=0.0023 e=1 g=0.99 em=0.04 df=0.16](../assets/cartPole_20k_1=0.0023_em=0.04_df=0.16.png)

As we can see, we are still getting similar results to the Lunar Landing environment—a little bit of learning early on and then nearly flat average reward from then on.

After doing a bunch of additional logging and debugging to further check if my `update` function was working correctly, I began doing some research on whether it was even possible to solve these environments using an approximate Q-learning implementation—and how many episode I should expect to have to run in order to do so.

After looking at several different resources and implementations on GitHub, I realized that the simple feature extractor I had created likely wasn't capturing the non-linear interactions **between** features that are necessary for the agent to learn in continuous state spaces like Lunar Lander or Cart Pole. To test this hypothesis, I decided to implement a very lightweight Radial Basis Function feature extractor to approximate the non-linear interactions between features.

Quick aside: A Radial Basis Function (RBF) measures similarity between vectors based on their distance from a central point, with similarity exponentially (inverse radius squared) decreasing as distance increases. Because of this property, RBFs are particularly effective for representing non-linear relationships between underlying linear vectors, making them useful for feature representation in approximate Q-learning with continuous state spaces.

With the RBF feature extractor implemented, I then ran another experiment to determine whether the underlying problem lay in the state space representation or approximate Q-learning logic.

For this experiment, I used the following hyperparameters:

* NUM_TRAINING_EPISODES = 3_000
* ALPHA = 0.0023
* EPSILON = 1.0
* GAMMA = 0.99
* EPSILON_MIN = 0.04
* DECAY_FRACTION = 0.16

Here are the results of the `Cart Pole run 2` experiment with the RBF feature extractor:

![Cart Pole RBF 3k Episodes with a=0.0023 e=1 g=0.99 em=0.04 df=0.16](../assets/cartPole_RBF_3k_a=0.0023_em=0.04_df=0.16.png)

As we can see, with the addition of a RBF feature extractor, the agent is able to learn in the Cart Pole environment! This means that the Approximate Q-learning logic is working correctly.

Next, we can go back to Lunar Lander environment, implement the RBF feature extractor there and see how well the agent learns.

### Retraining the Agent

After demonstrating learning with the RBF in the Cart Pole environment, I moved back to the Lunar Landing environment to test the new implementation. Unfortunately, the complexity of the environment led to prohibitively long training times and I was only able to train on 500 episodes (which took ~ 30 minutes) and resulted in even worse performance than my initial agent.

Here are the results from the RBF agent implementation:

![Lunar Lander RBF 500](../assets/lunarLander_RBF_500.png)

These results are less than ideal. We do see a steady tick up in the average learning rate, which indicates that the agent may be able to learn further with additional training time.

## Conclusion

Looking at the project I have completed thus far, I very happy with the amount of work I put in to build an Approximate Q-learning agent from scratch with minimal reliance on external libraries. While the results may not have been ideal, I believe that with more time, I would be able to extend this implementation with a Deep Q-network that would allow it to learn an optimal policy to solve the Lunar Lander environment.

In reflecting on what I would have done differently, I regret going down the route of implementing the RBF feature extractor and wish I would have jumped straight to the DQN. Because it is only utilizing the CPU, my training times using the RBF feature extractor were prohibitively large (nearly 30 minutes for 500 episodes) which did not allow me enough time to properly tune and test different hyperparameters. I went down this implementation route initially because I thought it would be easier to iterate and improve upon compared to the DQN—but in hindsight that assumption was incorrect.

One important take-away that I did get is that the Approximate Q-learning logic we implemented in HW4A does appear to be correct in that the agent is able to learn in the simplified Cart Pole environment. I am also very happy with the modular, separation of concerns that I implemented in my overall program layout. With this initial groundwork laid, I believe future improvements on this project will be vastly simplified.

I also regret that I don't have a working agent GIF to present due to the lack of time needed to properly train the agent utilizing the RBF strategy.

Just for reference, here is a GIF of `Lunar Lander run 1` agent:

![Best Lunar Lander Agent](../assets/LunarLandingAgent.gif)

From this GIF, it appears that the agents learned policy was to simply do nothing!

## References

1. Farama Foundation. (n.d.). Lunar Lander [Computer-software documentation]. Gymnasium. Retrieved April 23, 2025, from https://gymnasium.farama.org/environments/box2d/lunar_lander/

2. Stable Baselines3. (n.d.). Reinforcement-learning tips & tricks [Documentation]. Retrieved April 28, 2025, from https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html

3. DLR-RM. (n.d.). dqn.yml [Source code]. GitHub. Retrieved April 28, 2025, from https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/dqn.yml

4. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed., Adaptive Computation and Machine Learning series) [Kindle version]. MIT Press.

5. sy2002. (n.d.). rl-q-rbf-cartpole.py [Source code]. GitHub. Retrieved April 28, 2025, from https://github.com/sy2002/ai-playground/blob/master/rl-q-rbf-cartpole.py

6. Ogunfool. (n.d.). ApproximateRL_CartPole [Source code]. GitHub. Retrieved April 28, 2025, from https://github.com/Ogunfool/ApproximateRL_CartPole

7. scikit-learn. (n.d.). sklearn.preprocessing.StandardScaler [Documentation]. Retrieved April 28, 2025, from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

8. scikit-learn. (n.d.). sklearn.kernel_approximation.RBFSampler [Documentation]. Retrieved April 28, 2025, from https://scikit-learn.org/stable/modules/generated/sklearn.kernel_approximation.RBFSampler.html

9. Gibber, B. (n.d.). Function approximation. RL notes. Retrieved April 28, 2025, from https://gibberblot.github.io/rl-notes/single-agent/function-approximation.html

10. GeeksforGeeks. (n.d.). Function approximation in reinforcement learning. Retrieved April 28, 2025, from https://www.geeksforgeeks.org/function-approximation-in-reinforcement-learning/