# Beginner's Guide to Reinforcement Learning

## OpenAI Gym

We will use the OpenAI Gym toolkit to explore some of the aspects of *reinforcement learning* and program our first *learning agent*. Gym is a toolkit for developing and comparing reinforcement learning algorithms. You can learn more about it here: https://gym.openai.com/docs/

## Prerequisits
- some basic knowledge of Python
- a Python 2.7 or Python 3 installation (e.g., via Anaconda: https://www.anaconda.com/download/)
- Jupyter Notebook (comes with Anaconda)
- `pip` (to install dependencies)

## Installation

- you need to install **OpenAI Gym** (install via PyPI recommended)  
    https://gym.openai.com/docs/#installation    

```sh
pip install gym
```

## The Frozen Lake environment

![](frozenlake.png) <div style="text-align: center"> photo credit: Shea Gunther </div>

Winter is here. You and your friends were tossing around a frisbee at the park
when you made a wild throw that left the frisbee out in the middle of the lake.
The water is mostly frozen, but there are a few holes where the ice has melted.
If you step into one of those holes, you'll fall into the freezing water.
At this time, there's an international frisbee shortage, so it's absolutely imperative that
you navigate across the lake and retrieve the disc.
~~However, the ice is slippery, so you won't always move in the direction you intend.~~
The surface is described using a grid like the following

```
SFFF
FHFH
FFFH
HFFG
```

```
S : starting point, safe
F : frozen surface, safe
H : hole, fall to your doom
G : goal, where the frisbee is located
```

The episode ends when you reach the goal or fall in a hole.
You receive a reward of 1 if you reach the goal, and zero otherwise.

In [None]:
# import OpenAI gym and create the environment
import gym
#env = gym.make('FrozenLake-v0')
from gym.envs.toy_text.frozen_lake import FrozenLakeEnv
env = FrozenLakeEnv(is_slippery=False)

In [None]:
# reset the environment and get the initial state 
s = env.reset()

In [None]:
# display the current state of the environment
env.render()

In [None]:
# what does the action space look like, i.e., what actions can we take?
env.action_space

## Task 1 "Getting started"
- First, we want to get our agent safely across the frozen lake to recover the frisbee.
- Call `env.step` multiple times to get the agent to the goal location in the lower, right corner (indicated by `G`).

- The environment’s step function takes an `action` as input and returns four values. These are:
    1. `state` (**object**): an environment-specific object representing the next state of the environment.
    2. `reward` (**float**): amount of reward achieved by the previous action. Goal is always to increase your total reward.
    3. `done` (**boolean**): tells us if the episode terminated and whether it’s time to reset the environment
    4. `info` (**dict**): diagnostic information useful for debugging (not important in our case)
    
Example:
```python
state, reward, done, info = env.step(action)
```

In [None]:
env.reset()
# perform action here ...


## Task 2 "Random agent"
- Implement a **random agent** that interacts with then environment by taking random actions.
- Repeat for 10 episodes and record if the agent reached the goal location.
- Remember to `reset` the environment when the agent reached a *terminal* state.

In [None]:
# here goes your code ...


## Task 3 "Choosing actions"
- Implement $\epsilon$-greedy.
- Recall:
    - with probability $1-\epsilon$ take the *greedy* action
    - with probability $\epsilon$ take a random action
- Note: a random action might also be greedy.

In [None]:
# Hint: you can generate a uniform random number between 0 and 1 by:
import numpy as np
u = np.random.rand()
u

In [None]:
# assume the following Q-values are given for an arbritary state by a list
Q = [0.2, 1, 3, 0]

def egreedy(Q, epsilon=0.2):
    pass

Let's run the cell below.
- Do you observe a problem?
- Can you explain what is going on?

In [None]:
# assume all Q values are equal
Q = [0, 0, 0, 0]
for i in range(10):
    print(egreedy(Q))

## Task 4 "Q-Learning"
Now, let us implement Q-learning.

Initialize Q values for all possible state-action pairs $s, a$, e.g., $\forall s,a: Q(s,a)=0$
1. Choose action $A$ to take in current state $S$ with $\epsilon$-greedy
2. Take action $A$ and observe reward $R$ and next state $S'$
3. Update Q value for $S$, $A$:
    $$ Q(S,A) \gets Q(S,A) + \alpha \Big[ R + \gamma \max_{a} Q(S', a) - Q(S,A) \Big]$$
4. If $S'$ is not *terminal*, repeat from step 1

In [None]:
# your code goes here
alpha = 0.1 # the learning rate
gamma = 0.9 # the discount factor
num_episodes = 1000 # number of episodes

# initialize Q values here

for ep in range(num_episodes):
    s, done = env.reset(), False
    while not done:
        
        done = True # REMOVE THIS LINE

        # choose action

        # take action

        # update Q values


## Task 5 "Exploit learned knowledge"
- Now use the learned knowledge to bring the agent across the lake!

In [None]:
# your code goes here
