# Reinforcement Learning (RL)

RL involves learning what to do in order to maximize a numerical reward. Essentially, how to map situations to actions while maximizing a signal. 
The learner is not told in any way which actions to take, instead it must discover which actions yield the most reward by trying them out. This is one of the main challenges in RL; the trade-off between *exploration* and *explotation*. The agent has to exploit what it already knows in order to obtain a reward, but it also has to explore to make better action selections. Note that actions may affect not only immediate rewards, but also future rewards. 

### RL Applications

1. Games (e.g., [Atari](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), [Go](https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf), [Chess](https://arxiv.org/abs/1712.01815))
2. [Robotic Control](https://arxiv.org/pdf/1504.00702.pdf)
3. [Traffic Light Control](http://web.eecs.utk.edu/~ielhanan/Papers/IET_ITS_2010.pdf)

For more applications please refer to Dr. Yuxi Li's RL Applications [paper](https://arxiv.org/pdf/1908.06973.pdf).

### Elements of RL

<ul>
<li> <b> Policy </b>: defines the learning agent's way of behaving at a given time. In some cases the policy can be as simple as a lookup table, whereas in others it may involve complex search processes. In general, policies are stochastic.</li>
    
<li> <b> State </b>: concrete and immediate situation in which an agent finds itself.  </li>
    
<li> <b> Action </b>: set of all possible moves the agent can make.  </li>

<li> <b> Reward Signal </b>: defines the goal in a RL problem. The agent's sole objective is to maximize the total reward  it receives over the long run. The reward signal defines what are the good and bad events for the agent -- analogous to pleasure and pain.</li>

<li> <b> Value Function </b>: roughly speaking, a value function is the total amount of reward an agent can expect to accumulate. Whereas rewards are immediate, values indicate long-term desirability of states, which takes into account the states that are likely to follow and the rewards available in those states. Value functions will be closer to how pleased or displeased we are that our enviroment is in a particular state.</li>

<li> <b> Model </b>: is what mimics the behavior of the enviorment. Models are used for planning.</li>
    
</ul>


### Markov Decision Process (MDP)

Suppose we have a recycle bot, where the agent can:  <br>
(1) actively search for a can, <br>
(2) remain stationary waiting for a can, or <br>
(3) recharge its battery. 
<br>
The agent makes its decisions solely as a function of the energy level of the battery. In this case the **states** are *high* or *low*. **Actions** are *wait*, *search*, and *recharge*. <br>


![MDP](MDP.jpg "MDP")
![Agent-environment Interaction](RL_Model.jpg "Agent-environment Interaction")

<br>

This system is a finite MDP, where the value function ($v_{\pi}(s)$) can be defined as the expected value of a random variable given that the agent follows a policy ($\pi$), $v_{\pi}(s)$ has recurssive properties and it can be showed that the value function can be written as a Bellman equation with relates the value of a state and the values of its successor states.

$v_{\pi}(s) = \sum_a \pi(a|s)\sum_{s',r}p(s',r|r,a)[r+\gamma v_{\pi}(s')]$ <br>
<br>
$a$ = actions
$s'$ = future states
$r$ = the rewards
$\gamma$ = discounted factor (to dampen future rewards)

The *Bellman equation* averages over all the possibilities, weighting each by its probability of occurring. It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.

## Deep Learning (DL) in RL

When the state space or action space are too large to be completely known a neural network can be used to approximate a value function or policy function. Meaning that neural networks can learn to map states to values or state/action pairs to Q values (meausre of overall expected reward). These networks are trained on samples from the state or action to learn to predict how valuable those are relative to the target. Normally, the state will be something visual like a game screen, so Convolutional Neural Networks (CNNs) are often used. However, insted of distinguishing between a dog and a cat like in supervised learning in RL CNNs are used to determine possible actions based on the image.

![CNN_SL](CNN_cat.jpg "CNN")
![CNN_RL](CNN_game.jpg "CNN")

Initially CNNs are initialized stochastically and using feedback from the enviroment the CNN can use the difference between its expected reward and the actual reward to adjust its weights. The main difference with supervised learning's backpropagation is that in supervise learning the CNN starts with the true labels wheras in RL the CNN relies on the enviroment to send a scalar number in response to each new action (rewards). The complication here is that enviorments are often not differentiable (or the dynamics of the enviroment are unknown) which is a requirement for backpropagation; the loss is differentible with respect to the weights (to compute the gradients). Therefore, RL uses two main algorithms:

<ol>
    <li> <b> Policy Gradients </b>: gradient estimation of the reward with respect to the weights. </li>
    <li> <b> Q-learning </b>: learn the expected reward of each state/action pair using standard backpropagation then taking the highest expected reward. Given an action compute the state and learn the expected reward, repeat this for different actions and maximize the reward.
</ol>

![Q-learning](Qlearning.jpg "Q")

Refer to this [Nature paper](https://www.nature.com/articles/nature14236) for a full explanation on Q-Learning in Deep RL.

### OpenAI Gym Example

In [8]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.cem import CEMAgent
from rl.memory import EpisodeParameterMemory

ENV_NAME = 'CartPole-v0'

# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)

nb_actions = env.action_space.n
obs_dim = env.observation_space.shape[0]

# Simple Neural Network
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(nb_actions))
model.add(Activation('softmax'))
print(model.summary())


# Compile the Agent
memory = EpisodeParameterMemory(limit=1000, window_length=1)
cem = CEMAgent(model=model, nb_actions=nb_actions, memory=memory,
               batch_size=50, nb_steps_warmup=2000, train_interval=50, elite_frac=0.05)
cem.compile()

# Visualize
cem.fit(env, nb_steps=100000, visualize=False, verbose=2)
# Save best weights
cem.save_weights('cem_{}_params.h5f'.format(ENV_NAME), overwrite=True)
# Evaluate our algorithm for 5 episodes.
cem.test(env, nb_episodes=5, visualize=True)

Using TensorFlow backend.


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 10        
_________________________________________________________________
activation_1 (Activation)    (None, 2)                 0         
Total params: 10
Trainable params: 10
Non-trainable params: 0
_________________________________________________________________
None
Training for 100000 steps ...
    15/100000: episode: 1, duration: 0.025s, episode steps:  15, steps per second: 611, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.333 [0.000, 1.000],  mean_best_reward: --
    38/100000: episode: 2, duration: 0.007s, episode steps:  23, steps per second: 3307, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.0

<tensorflow.python.keras.callbacks.History at 0x1d0d0486c48>

For a full explanation of this example go to: [wiki](https://github.com/openai/gym/wiki/CartPole-v0)

## Atari Paper (Playing Atari with Deep Reinforcement Learning)

In essence DeepMind used a CNN trained on a variant of Q-learning to create a Deep RL model which palyed seven (Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest, and Space Invaders) Atari 2600 games (with no architecture adjustments) and outpreformed human experts in three of the seven games (Breakout, Enduro, and Pong).

The agent, a CNN, directly takes the image (pixels) from an Atari game which then given the action/state move it assigns values to each of the 18 possible joystick actions. The model outputs the agent expectation of the future reward it will get if it preforms a given action given a state. Then, the RL model can pick the action with the highest reward based on the enviorment (the Atari emulator)

A big challenge here, is that in Deep Leearning often the input samples are assumed to be unrelated to each other. This is not the case for RL, a simple example would be chess where it might take several steps to knock an opponent's king, and each the early moves do not return an immediate reward as the final move does, even when one of those moves might be crucial to win.

![Atari](Atari.jpg "A")

To overcome these issues DeepMind kept a record of all experiences, then used randomly distributed batches of the saved experiences to train. This makes the training data samples more random en pseudo-uncorrelated. 

### Backgound 

There are multiple frames that will be almost exactly the same while playing the game (the player might remain in the same position for seconds) meaning that the same action is continued into different frames -- these make a sequence of several frames with same action between them with an individual state. This gives rise to a large but finite MDP. 

TD Gammon was the first architecture that was used to play Atari in 19 it used an on-policy appoach where it trained a perceptron using direct samples of experiences drawn from the algorithm's interaction with the enviroment. This algorithm can often get stuck in  local minimums given biased samples. Instead of using this approach this papers explores the use of Experience Replay, where the experiences are stored for each time step, pooled over many *memories*, and via Q-learning updates drawn at random. After performing the Experience Replay algorith the agent selects and executes an action according to a greedy policy. Experience Replay reduces the bias on the samples by taking an average over many of its previous states -- making the algorithm an off-policy algorithm because the current parameters are different to those to generate the sample.

### Architecture Deep-Q-Networks (DQN)

![Atari](architecture.jpg "Ar")
![Atari](layers.jpg "L")

The output corresponds to the predicted Q-values of the individual action fot the input state, the main advatange of this architecture is the ability to compute Q-values for all possible actions in a given state with a single forward pass through the network.

Network:
1. Input 84x84x4 image
2. Hidden Layer: Convolution 16 8x8 filters with stride of 4 followed by ReLu
3. Second Hidden Layer: Convolution 32 4x4 filters with stride of 2 followed by ReLu
4. Final Hidden Layer: Fully-connected 256 rectifer units
5. Output: Fully-connected linear layer with a single output action varying between 4 and 18 depending on the game.

In [None]:
# From Tutorial:https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26 (uses theano)
def atari_model(n_actions):
    ATARI_SHAPE = (4, 105, 80)
    frames_input = keras.layers.Input(ATARI_SHAPE, name='frames')
    actions_input = keras.layers.Input((n_actions,), name='mask')

    normalized = keras.layers.Lambda(lambda x: x / 255.0)(frames_input)
    
    conv_1 = keras.layers.convolutional.Convolution2D(16, 8, 8, subsample=(4, 4), activation='relu')(normalized)
    conv_2 = keras.layers.convolutional.Convolution2D(32, 4, 4, subsample=(2, 2), activation='relu')(conv_1)
    conv_flattened = keras.layers.core.Flatten()(conv_2)
    hidden = keras.layers.Dense(256, activation='relu')(conv_flattened)
    output = keras.layers.Dense(n_actions)(hidden)
    filtered_output = keras.layers.merge([output, actions_input], mode='mul')

    self.model = keras.models.Model(input=[frames_input, actions_input], output=filtered_output)
    optimizer = optimizer=keras.optimizers.RMSprop(lr=0.00025, rho=0.95, epsilon=0.01)
    self.model.compile(optimizer, loss='mse')

Note: rewards are fixed to all positive rewards to 1, all negative rewards to -1, and zero rewards remained the same.

During training there is no thing such as *validation accuracy* to compare with. In Deep RL it is often used the total reward as an evaluation metric, however given that this model takes averages the total score might be noisy. So, DeepMind used the Q-value itself as the evaluation metric (no idea how they realized this would work).

![R](reward.jpg "R")


### Results

![R](results.jpg "R")

Using ONLY raw pixels as an input this paper demonstrated an ability to master Atari Games using DQN.

## Atari Game Example
Note, that this code does not work on Windows because OpenAI Gym does not support BreakoutDererministic-v4 in Windows

### Libraries

In [None]:
from PIL import Image
import numpy as np
import gym

# Keras Libraries
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
from keras.optimizers import Adam
import keras.backend as K

# Keras-rl Libraries
from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, BoltzmannQPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint, Callback

# Visualization Libraries
from IPython import display
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML
%matplotlib inline

# Window Size
INPUT_SHAPE = (84, 84)
WINDOW_LENGTH = 4

# Random Seed
np.random.seed(123)
env.seed(123)

### Define Atari Processor

In [None]:
class AtariProcessor(Processor):
    def process_observation(self, observation):
        assert observation.ndim == 3
        img = Image.fromarray(observation)
        img = img.resize(INPUT_SHAPE).convert('L')
        processed_observation = np.array(img)
        assert processed_observation.shape == INPUT_SHAPE
        return processed_observation.astype('uint8')

    def process_state_batch(self, batch):
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

    def process_reward(self, reward):
        return np.clip(reward, -1., 1.)

This is the part that DOES NOT work on Windows Systems

In [None]:
env = gym.make("BreakoutDeterministic-v4")
nb_actions = env.action_space.n
input_shape = (WINDOW_LENGTH,) + INPUT_SHAPE

### Model (Deep Learning)

In [None]:
model = Sequential()
model.add(Permute((2, 3, 1), input_shape=input_shape))
model.add(Convolution2D(32, (8, 8), strides=(4, 4)))
model.add(Activation('relu'))
model.add(Convolution2D(64, (4, 4), strides=(2, 2)))
model.add(Activation('relu'))
model.add(Convolution2D(64, (3, 3), strides=(1, 1)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)
processor = AtariProcessor()

policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05,
                              nb_steps=1000000)

dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=50000, gamma=.99, target_model_update=10000,
               train_interval=4, delta_clip=1.)

dqn.compile(Adam(lr=.00025), metrics=['mae'])

In [None]:
weights_filename = 'dqn_{}_weights.h5f'.format(ENV_NAME)
checkpoint_weights_filename = 'dqn_' + ENV_NAME + '_weights_{step}.h5f'
log_filename = 'dqn_{}_log.json'.format(ENV_NAME)
callbacks = [ModelIntervalCheckpoint(checkpoint_weights_filename, interval=250000)]
callbacks += [FileLogger(log_filename, interval=100)]
dqn.fit(env, callbacks=callbacks, nb_steps=1750000, log_interval=10000)

dqn.save_weights(weights_filename, overwrite=True)
dqn.test(env, nb_episodes=1, visualize=False)

### To visualize on Jupyter Notebook

In [None]:
class Render(Callback):
    def on_step_end(self, step, logs={}):
        plt.clf()
        plt.imshow(env.render(mode='rgb_array'))
        display.display(plt.gcf())
        display.clear_output(wait=True)

#weights_filename = 'dqn_{}_weights.h5f'.format(ENV_NAME)
weights_filename = 'dqn_{}_weights_1750000.h5f'.format(ENV_NAME) # @check point
dqn.load_weights(weights_filename)

callbacks = Render()
plt.figure(figsize=(6,8))
dqn.test(env, nb_episodes=1, visualize=False, callbacks=[callbacks])

env.close()
print('END')

### Render

In [None]:
ims = []

class Render(Callback):
    def on_step_end(self, step, logs={}):
        im = plt.imshow(env.render(mode='rgb_array'))
        ims.append([im])

weights_filename = 'dqn_{}_weights_1750000.h5f'.format(ENV_NAME)
dqn.load_weights(weights_filename)

callbacks = Render()
fig = plt.figure(figsize=(4,5))
plt.axis('off')
dqn.test(env, nb_episodes=1, visualize=False, callbacks=[callbacks])

ani = animation.ArtistAnimation(fig=fig, artists=ims, interval=10)
# ani.save("anim.gif", writer = "imagemagick")  # imagemagick for Ubuntu
plt.close()

HTML(ani.to_jshtml())         # JavascriptHTML output
#HTML(ani.to_html5_video())   # HTML5 Video output

## References/Resources

1. https://medium.com/free-code-camp/explained-simply-how-deepmind-taught-ai-to-play-video-games-9eb5f38c89ee
2. https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26
3. https://github.com/mirrornerror/Keras-RL/blob/master/dqn-atari-jupyter.ipynb (Keras RL)
4. https://ai.stackexchange.com/questions/4660/how-to-combine-backpropagation-in-neural-nets-and-reinforcement-learning
5. https://stats.stackexchange.com/questions/340651/how-does-backpropagation-work-in-the-case-of-reinforcement-learning-for-games
6. https://adventuresinmachinelearning.com/reinforcement-learning-tutorial-python-keras/
7. http://incompleteideas.net/book/RLbook2018.pdf (introduction to RL)