# Reinforcement learning

## Description of the problem

An **entity** can be is a **state** $s$. Each state has associated a **reward** $R(s)$. From that state it can take **actions** $a$ that will take the entity to another state $s'$. The state, its reward, an action that can be taken in that state, and the state the entity will get to with that action can be represented with a tuple:

$$(s, a, R(s), s')$$

**Return** is the sum of the rewards from a sequence of states and actions, wieghted by a discount factor that compounds along the series of actions. 

$$return = R(s) + \gamma R(s') + \gamma^2 R(s'') + \dots$$

A **Policy** is a function $\Pi$ that for a given state gives us the action $a$ to take next.

The objective of **reinforcement learning algorithm** is to find a policy $\Pi(s) = a$ that maximizes the return.

This problem formulation is a **Markov Decision Process (MDP)**. In a MDP, the future only depends on the current state, regardless of how we have gotten to that state.

## State action value function

We define the **state action value function** (sometimes also called *Q-function*) $Q(s, a)$ is the return if you,
* start in state $s$
* take action $a$
* behave optimally after that

The best possible return from state s is $\max_{a} Q(s,a)$

## Bellman equation

$$Q(s, a) = R(s) + \gamma \max_{a'} Q(s', a')$$

Where $a'$ is each of the actions that can be taken in state $s'$

## Random (stochastic) environment

In this case the return of a sequence of states is

return = $$E[R(s) + \gamma R(s') + \gamma^2 R(s'') + \dots ]$$

And the Belmman equation is:

$$Q(s,a) = R(s) + \gamma E[\max_{a'} Q(s', a')]$$

## Continuous state spaces

Example: state space for a car, in a 2-D world,

$$\begin{bmatrix}x \\ y \\ \theta \\ \dot{x} \\ \dot{y} \\ \dot{\theta}\end{bmatrix}$$

where $x$ and $y$ are the coordinates, $\theta$ is the orientation (angle), and $\dot{x}$, $\dot{y}$, $\dot{\theta}$ the rate of change along each coordinate and the orientation.

## Learning the state action value function

Idea: Train a neural network to calculate, for a given state $s$, the return of the state action value functions for the actions possible in that state, so we can choose the one with largest $Q(s, a)$. In other words, train a neural network that given s and a returns $y \approx Q(a,a)$. Or, in less words, train the neural network to learn the Bellman equation.

To do so, we can create a large set of tuples

$$(s^{(1)}, a^{(1)}, R(s^{(1)}), s'^{(1)}) \\ (s^{(2)}, a^{(2)}, R(s^{(2)}), s'^{(2)}) \\ \dots$$

And then, the training examples for the neural network will be:

* For the inputs $x$, each of the tuples 
$$(s^{(1)}, a^{(1)}), (s^{(2)}, a^{(2)}), \dots$$
* For the target values y, the corresponding 
$$Q(s^{(1)},a^{(1)}), Q(s^{(2)},a^{(2)}), \dots$$

calculated with the Bellman equation, for example

$$Q(s^{(1)}, a^{(1)}) = R(s^{(1)}) + \gamma \max_{a'} Q(s'^{(1)}, a')$$

Note that the target values $y$ depend only on the last two elements of the tuples $(s^{(i)}, a^{(i)}, R(s^{(i)}), s'^{(i)})$

At the begining, we don't know the $Q(s, a)$ function, but it can be initialized randomly. In every step, it will get better.

Learning algorithm (sometimes call the **Deep-Q network**)

<pre>
    Initialize neural network randomly as guess of Q(s, a)
    Repeat {
        Take actions to generate tuples (s, a, R(s), s')
        Store the 10,000 more recent examples of these tuples (replay buffer)
        Train neural network:
            Create training set of 10,000 examples x, y using
                x = (s, a) and y = R(s) + &gamma; max<sub>a'</sub> Q(s', a')
            Train Q<sub>new</sub> such that Q<sub>new</sub>(s, a) &asymp; y
        Set Q = Q<sub>new</sub>
    }
</pre>

> Note: It is not clear in the lecture, but wonder if the "take actions to generate tuples" in the lesson means take sequence of actions until you reach a final state. Refer to the ideas in the "Search" chapter of the *CS50: AI with Python* course in edX. Maybe not, since here we are just trying to generate training samples to calculate $Q(s, a)$

One possible architecture of the neural network is (from course example for lunar lander, with 8 parameters for the state,and 4 possible actions, one hot encoded):

```pyhton
tf.keras.models.Sequential ([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
]) 
```

An improved architecture uses (for this case) four units in the output layer, to compute at the same time the $Q(s, a)$ function for all the possible actions in one state. The input, in this case, is the 8 parameters that represent the state.

```pyhton
tf.keras.models.Sequential ([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(4)
]) 
```

## Algorithm refinement: $\epsilon$-greedy policy

How to improve the "`Take actions to generate tuples (s, a, R(s), a')`" step in the algorithm?

> Note to self: again, refer to the "Search" chapter of the *CS50: AI with Python* course in edX.

Instead of taking the actions randomly, use the following algorithm,

<pre>
    With probability (1- &epsilon;) pick the action a that maximizes Q(s, a)
    With probability &epsilon; pick an action a randomly
</pre>

## Algorithm refinement: mini-batch and soft updates

### Mini-bacthes

This refinement also applies to linear regression or the training of a neural network.

Idea: instead of using all the samples to calculate the cost function in each step of the gradient decent algorithm, do it using a subset (*batch*) of the trainign examples (e.g., with a training set of 1,000,000, use a batch of 1,000 examples).

### Soft updates

The last step in the algorithm was to replace the $Q(s, a)$ function with the newly calculated $Q_{new}(s, a)$. Doing so, it can create abrupt changes in the Q function, sometings replacing a somehow good function by a worse one.

The idea is instead of replacing the parameters of the newural network with the new ones, replace them so that:

$$
   W = 0.01 W_{new} + 0.99 W \\
   B = 0.01 B_{new} + 0.99 B
$$

In [1]:
import gym
import tensorflow as tf
from tensorflow import keras


2023-03-08 21:31:38.330062: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512_VNNI
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [17]:
ITERATIONS = 10000

In [8]:
env = gym.make('FrozenLake-v1', desc=None, map_name="8x8", is_slippery=True, render_mode="human")

STATE_FEATURES = 1
POSSIBLE_ACTIONS = 4


In [9]:
state, reward, terminated, truncated, info = env.step(env.action_space.sample())
env.render()

In [10]:
nnetwork = tf.keras.models.Sequential ([
    tf.keras.layers.Input(shape=STATE_FEATURES),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(POSSIBLE_ACTIONS)
])

nnetwork.summary()

nnetwork.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.01),
    loss=tf.keras.losses.MeanSquaredError()
)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 64)                128       
                                                                 
 dense_4 (Dense)             (None, 64)                4160      
                                                                 
 dense_5 (Dense)             (None, 4)                 260       
                                                                 
Total params: 4,548
Trainable params: 4,548
Non-trainable params: 0
_________________________________________________________________


In [18]:
buffer = []
state, _ = env.reset(1)

In [19]:
for _ in range(ITERATIONS):
    action = env.action_space.sample()
    state_prime, reward, done, _, _ = env.step(action)
    env.render()
    buffer.append((state, action, reward, state_prime))
    if done:
        # Go back to initial state
        state = env.reset()
    else:
        state = state_prime


KeyboardInterrupt: 

In [20]:
env.close()
