# Homework 8: Q-Learning

In this homework, we will be implementing Q-learning for the Pacman environment.
The inspiration for Pacman trials comes from medical RL research using simulated rat mazes in [*Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3483721/) and [*Memory Alone Does Not Account for the Way Rats Learn a Simple Spatial Alternation Task*](https://www.jneurosci.org/content/40/38/7311/)

As usual, sections where you have to write code are denoted by a `#CODE HERE`. 

## The Environment

We're using an adapted version of the Pacman environment developed right here by Berkeley professors for CS188!
Although we will only code in qLearningAgents.py, feel free to explore the environment for more maps, documented implementations, and ways to test our agents. Relevant python files:

| File | Description |
| :--- | :-- |
| `qLearningAgents.py` | Our Q-learning code, used in `pacman` and `gridworld`. All coding tasks are in this file |
| `pacman.py` | Runs Pacman games |
| `layouts/` | Various Pacman maps |
| `gridworld.py` | Simplified simulation from demo, has different maps |
| `featureExtractor.py` | Functions for converting Pacman game state to a few observables |
| `utils.py`  | Useful data structures, probabilistic methods, and misc functions |

You should have numpy and torch from installed from previous assignments.

### Running Code
In a terminal, cd into the folder then run any of these commands. `python pacman.py -h` and `python gridworld.py -h` will explain what each argument does -- feel free to change them up!

The demo I ran in class, using your implementation (good for stepping through Q updates when debugging):
- `python gridworld.py --agent q -g BridgeGrid --manual --noise 0 --learningRate 1.0 --discount 0.9 --episodes 100`

Training and visualizing Q-learning for Pacman:
- `python pacman.py -numTraining 100 -numGames 110 --layout smallGrid -p PacmanQAgent`


## Q-learning (tabular)
Read through `QLearningAgent` in `qLearningAgents.py`. Fill in these functions:

### Policy:
Implement a policy based on the Q values, accessed with `self.getQValue(state,action)`. The best action should maximize our estimated discounted rewards -- the Q value. 
- Fill in `getBestAction`

### Exploration vs Exploitation:
During training, we want to find new strategies through random exploration. 
- Fill in `getAction`

### Bellman Update:
Update the Q function towards the Bellman equation. We want our Q to match the recursive property of the ideal Q function.

$$\text{target} = r_t + \gamma \max_{a_{t+1}} Q(s_{t+1},a_{t+1})$$
- Fill in `update`

$$Q(s_t, a_t) \leftarrow (1-\alpha)Q(s_t, a_t) + \alpha * \text{target}$$
- Fill in `incrementQValue`

### Run It!

Test it out on `gridworld.py` and see if you're getting Q values you expect.
- `python gridworld.py --agent q -g BridgeGrid --manual --noise 0 --learningRate 0.5 --discount 0.9 --episodes 100`

Once that works, it's time for the real test: `pacman.py`. Let's see how your agent fares against ghosts!
- `python pacman.py -numTraining 500 -numGames 505 --layout smallGrid -p PacmanQAgent`


## Deep Q-learning

### Using Features

In reality, rats don't know their x,y coordinates, they can only sense their local environment. Since we shouldn't have access to the state of the world, replace $Q(s_t, a_t)$ with $Q(f_t)$, where $f_t$ is a set of sensible features defined in `featureExtractor.py`:
- Are we about to hit a food (can be seen)
- How far is the closest food (can be smelled!)
- Are we adjacent to a ghost (you can feel the spookiness!)

Read through `featurizedQAgent`, notice how we only had to modify accesses to self.Q_values, so the rest of the functions are unchanged!

Test out our featurizedQAgent on `pacman.py`.
- `python pacman.py -numTraining 500 -numGames 505 --layout smallGrid -p FeaturizedQAgent`

### Using Neural Nets

Instead of storing a value of Q for each possible $(s_t, a_t)$, or now each possible $f_t$, let's fit a function! Much like line fitting, once a parameterized function is fit to some datapoints, it can generalize to never before seen data!

Our Bellman Update becomes
$$\text{target} = r_t + \gamma \max_{a_{t+1}} Q_\theta(s_{t+1},a_{t+1})$$
$$\text{loss} = \left(Q_\theta(s_t, a_t) - \text{target}\right)^2$$
$$\theta \leftarrow \theta - \alpha \nabla_\theta \text{loss}$$

The only difference is how we increment our Q values!

Read through `DeepQAgent`. This 1-layer neural network is implemented exactly like you programmed COVID models in Assignment 2! Notice we are using MSE "Mean Squared Error" loss.

Test out our deepQAgent on `pacman.py`. Instead of learning a separate value for each possible combination of features, hopefully it should learn faster by generalizing!
- `python pacman.py -numTraining 300 -numGames 305 --layout smallGrid -p DeepQAgent`

Now let's see it on a real map!
- `python pacman.py -numTraining 300 -numGames 305 --layout mediumClassic -p DeepQAgent`


## Homework Submission

Nice job implementing Q-learning! Please upload `qlearningAgents.py` [here](https://forms.gle/AMWsG57KxHU3Zd7v7), and feel free to include any cool pacman videos.