# Reinforcement learning

In reinforcement learning an _agent_ make _observations_ and takes _actions_ in an _environment_ and receive _rewards_(or negative rewards) from the 
environment. The goal of the agent to act in the best way to receive the maximum reward. The algorithm used by the agent to determine its action is called
__policy__. A policy can be a neural network that takes the observations as inputs and outputs the action to take. But the policy sometime will not take
observations like in the case of a vacuum cleaner whose rewards is determined by the amount of dust it picks up. The policy could make the robot move 
forward with some probability _p_ and turn randomly left or right with a probability _1 - p_. The random angle's value would oscilate from -r to r. Since
the policy involves randomness it is called _schotastic policy_. To find the best set of hyperparameters we have multiple methods at our disposition such as
__policy search__ which is simply trying out many different sets of values and keeping the one with the best performances. But when the __policy space__ is
too large this will not lead to a good result. Instead we would use __genetic algorithms__ which is generating 100 policies and keeping only the 20 best,
generating variants of those left and iterating until we find an appropriate final policy. We also can use optimization techniques, by evaluating the 
gradients of the rewards with regard to the policy parameters, then tweaking these parameters by following the gradients toward higher rewards. This 
approach is called __policy gradients (PG)__.

## Using OpenAI gym

In order to train our agents we will need a simulated environment. We are going to use OpenAI gym librairy which gives us many different simulated
environment to train an agent to be able to play an atari game autonomously. We are going to create a CartPole environment. This is a 2D simulation in 
which a cart can be accelerated left or right in order to balance a pole placed on top of it.

In [5]:
import gym

env = gym.make("CartPole-v1", render_mode="rgb_array")

After creating the environment we need to initialize it. It will return the first observation. Observations depends on the type of environment, in our case
it is in the form of 1D numpy array containing 4 floats containing the cart's horizontal position, its velocity, the angle of the pole and its angular
velocity.

In [6]:
obs, info = env.reset(seed=42)

# We can also look at the action space
print(env.action_space)

# Let's do an action and look at the result
action = 1 # Leaning the pole toward the right
obs, reward, done, truncated, info = env.step(action)
print(f"Observation: {obs}")
print(f"Reward: {reward}")
print(f"Done: {done}")
print(f"Truncated: {truncated}")
print(f"Info: {info}")

Discrete(2)
Observation: [ 0.02727336  0.18847767  0.03625453 -0.26141977]
Reward: 1.0
Done: False
Truncated: False
Info: {}


We are coding to code a simple policy  that accelerates left when the pole is leaning toward the left and accelerates right when the pole is leaning toward 
the right. 

In [7]:
import numpy as np

def basic_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1

totals = []
for episode in range(500):
    episode_rewards = 0
    obs, info = env.reset(seed=episode)
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, truncated, info = env.step(action)
        episode_rewards += reward
        if done or truncated:
            break
    totals.append(episode_rewards)

print(np.mean(totals), np.std(totals), min(totals), max(totals))

41.698 8.389445512070509 24.0 63.0


As we can see the resuls of it are not very good. Alternatively we can try to make a neural network policy. It will output a probability for each action
and one action will be chosen randomly according to the weight of its probability. In the case of the CartPole environment, there are just two possible actions (left or right), so we only need one output neuron.

In [8]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(5, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

As discussed earlier the best way to evaluate this types of algorithms is __policy gradients__. One of the most popular one is called reinforce algorithms
and it is defined as follows:
- First, let the neural network policy play the game several times, and at each step, compute the gradients that would make the chosen action even more 
 likely—but don’t apply these gradients yet.
- Once you have run several episodes, compute each action’s advantage.
- If an action’s advantage is positive, it means that the action was probably good, and you want to apply the gradients computed earlier to make the action
 even more likely to be chosen in the future. However, if the action’s advantage is negative, it means the action was probably bad, and you want to apply 
 the opposite gradients to make this action slightly less likely in the future. The solution is to multiply each gradient vector by the corresponding 
 action’s advantage.
- Finally, compute the mean of all the resulting gradient vectors, and use it to perform a gradient descent step.

In [None]:
def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        left_proba = model(obs[np.newaxis])
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32) # type: ignore
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, truncated, info = env.step(int(action))
    return obs, reward, done, truncated, grads

# Now let’s create another function that will rely on the play_one_step() function to play multiple episodes, returning all the rewards and gradients for each episode and each step:
def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
    all_rewards = []
    all_grads = []
    for episode in range(n_episodes):
        current_rewards = []
        current_grads = []
        obs, info = env.reset()
        for step in range(n_max_steps):
            obs, rewards, done, truncated, grads = play_one_step(env, obs, model, loss_fn)
            current_rewards.append(rewards)
            current_grads.append(grads)
            if done or truncated:
                break
        
        all_rewards.append(current_rewards)
        all_grads.append(current_grads)

    return all_rewards, all_grads

The algorithm will use the play_multiple_episodes() function to play the game several times (e.g., 10 times), then it will go back and look at all the 
rewards, discount them, and normalize them. To do that, we need a couple more functions; the first will compute the sum of future discounted rewards at 
each step, and the second will normalize all these discounted rewards (i.e., the returns) across many episodes by subtracting the mean and dividing by the 
standard deviation.

In [None]:
def discount_rewards(rewards, discount_factor):
    discounted = np.array(rewards)
    for step in range(rewards - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_factor
    return discounted

def discount_and_normalize_rewards(all_rewards, discount_factor):
    all_discounted_rewards = [discount_rewards(rewards, discount_factor) for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    rewards_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / rewards_std for discounted_rewards in all_discounted_rewards]


Now let’s define the hyperparameters. We will run 150 training iterations, playing 10 episodes per iteration, and each episode will last at most 200 steps. 
We will use a discount factor of 0.95:

In [None]:
n_iterations = 150
n_episodes_per_update = 10
n_max_steps = 200
discount_factor = 0.95

# We also need an optimizer and a loss function
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = tf.keras.losses.binary_crossentropy

# Then we can run the training loop
for iteration in range(n_iterations):
    all_rewards, all_grads = play_multiple_episodes(env, n_episodes_per_update, n_max_steps, model, loss_fn)
    all_final_rewards = discount_and_normalize_rewards(all_rewards, discount_factor)
    all_mean_grads = []
    for var_index in range(len(model.trainable_variables)):
        mean_grads = tf.reduce_mean([final_reward * all_grads[episode_index][step][var_index] for episode_index, final_rewards in enumerate(all_final_rewards) for step, final_reward in enumerate(final_rewards)], axis=0)
        all_mean_grads.append(mean_grads)
    optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))

The simple policy gradients algorithm we just trained solved the CartPole task, but it would not scale well to larger and more complex tasks

## Markov Decision Processes

The mathematician __Andrey Markov__ studied schotastic processes with no memory, he called it the __Markov chains__. A Markov Decision Process (MDP) is a core concept in reinforcement learning that defines how an agent interacts with its environment to make sequential decisions. It is characterized by five key components: states, which describe the different situations the agent can encounter; actions, which are the choices available to the agent in each state; a transition function that determines the probability of moving from one state to another given a specific action; a reward function, which provides feedback (positive or negative) based on the outcome of an action; and a discount factor, which balances the value of immediate rewards against future rewards. The agent's objective is to learn a policy. MDPs assume the Markov property, meaning that the future state depends only on the current state and action, not on past states, simplifying the decision-making process. By optimizing the policy, the agent learns to achieve the highest cumulative reward over time, even in uncertain or stochastic environments.