<img src="images/criteoAiLab.png" heigth=45>

# Practical Session : Deep RL

## Agenda

  1. Deep-RL Theory Reminder (15')
  1. DQN (30')
  1. $\epsilon$-DQN (20')
  1. DQN with Replay (20')
  1. Potential problems & solutions (10')
  1. Wrap-up (5')
  
(!) Regular polls/quiz
  
  <img src="images/logo.png" width="250px" align='center'>

# Deep-RL Theory
   
<img src="images/rl.png" width="480" alt="test"/>

**Goal**: Maximise w.r.t $\pi$ to collect more cumulated discounted reward 
  - $R = \sum \limits_{t=0}^{\infty} \gamma^t r_t$
  - $0<\gamma<1$
  

## Reinforcement Learning for a Markov Decision Process 

Under a fixed policy, $V$ is the value of the state:

<img src="images/vpi.png" width="480"/>

Under a fixed policy, $Q$ is the value of the tuple (state, action):

<img src="images/qpi.png" width="640"/>

Remark: after time $t$, actions are chosen by $\pi$ so in expectation we could replace $Q^\pi(s_{t+1}, a_{t+1})$ by $V^\pi(s_{t+1})$

### Q-learning

 - Idea is to learn $Q$, hence to know which action is best in a given state

## Bellman Optimality Principle

The optimal policy $\pi^*$ must respect 

<img src="images/bellman.png" width=320>

**Strategy**: While this equality doesn't hold we'll try to improve $\pi$

For a fixed policy $\pi$ if the quantity are not equals :
  - either $Q$ is not correctly estimated (policy evaluation issue) 
  - or $\pi$ is not selecting optimal actions, this can be fixed being more greedy wrt $Q$

<img src="images/evalimprov.png" width=480>

<center><sup>(R. S. Sutton, 1998, Reinforcement Learning An introduction)</sup></center>

# Deep Q-Learning (DQN)

<img src="images/dqn-th.png" width=640>

**Q-learning**:

While $Q^\pi(s_t, a_t) \ne r_t + \gamma \ max \ {Q^\pi(s_{t+1}, a)}$:
  1. take action
  1. receive reward / new state 
  1. improve target evaluation

# Deep Q-Learning (DQN)
  - $Q$ modeled as a NN
  
  - inputs = states
  - outputs = actions

<center><img src="images/q-learning.png" width=480><center>
<center><sup><sub>Image from https://leonardoaraujosantos.gitbooks.io</sub></sup></center>

# Deep-RL Benchmarks

## OpenAI 

  - [Gym](https://gym.openai.com/envs/CartPole-v0/) : set of standard problems / environments

### Simple control problems

<img src="images/control.png" width=480>

### Famous learn to play Atari

<img src="images/atari.png" width=480>
<center><sup><sub>Images from https://github.com/dgriff777/rl_a3c_pytorch</sub></sup></center>



In [None]:
class RLEnvironment:

    def run(self, agent, episodes=100, ...):
        """
        Run the agent.

        Pseudo-code:
        ```
            for i in 1..episodes {

                start new episode

                while episode not finished {

                    ask agent to take action based on current state
                    action resolved by environment, returning reward and new state

                    <opportunity to feedback agent with (state, reward, new state)>
                    <opportunity to update agent parameters>

                    if new state means agent failed {
                        terminate episode
                    }

                }

                <opportunity to update agent parameters again>

                if last episodes show enough reward {
                    declare task solved
                }

            }

        ```
        """

In [None]:
class RandomAgent:
    """The world's simplest agent!"""
    def __init__(self, action_space):
        self.action_space = action_space

    def get_action(self, state):
        return self.action_space.sample()

In [None]:

env.run(RandomAgent(env.action_space), episodes=20, display_policy=True)

### CartPole
  - $(x, \dot x)$: position/speed of cart
  - $(\theta, \dot \theta)$: angle/angular velocity of the pole
  - actions: move left/right
  - environment simulates acceleration (cart/pole mass)
<img src="images/cartpole.gif" width=640>

## 3. DQN in practice
  - Skeleton provided, implemented with [Keras](https://keras.io/)

In [None]:
class DQNAgent(RLDebugger):
    def __init__(self, observation_space, action_space):
        self.learning_rate = ??? 
        
    def build_model(self):
        model = Sequential()
   
        model.add(Dense(???, input_dim=self.state_size, activation=???))
        model.add(Dense(self.action_size, activation=???))

        model.compile(loss=???, optimizer=Adam(lr=self.learning_rate))

        model.summary()
        return model

In [None]:

    # get action from model using greedy policy. 
    def get_action(self, state):
        q_value = self.model.predict(state)
        best_action = np.argmax(q_value[0])
        return best_action

    # train the target network on the selected action and transition
    def train_model(self, action, state, next_state, reward, done):
        target = self.model.predict(state)

        target_val = self.target_model.predict(next_state)

        if done:
            target[0][action] = reward
        else:
            target[0][action] = reward + self.gamma * (np.amax(target_val))

        loss = self.model.fit(state, target, verbose=0).history['loss'][0]
        self.record(action, state, target, target_val, loss)        

In [None]:
agent = DQNAgent(env.observation_space, env.action_space)
env.run(agent, episodes=100)

```
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 30)                150       
=================================================================
Total params: 150
Trainable params: 150
Non-trainable params: 0
_________________________________________________________________
Episode 10, Total reward 9.0
Episode 20, Total reward 9.0
Episode 30, Total reward 10.0
Episode 40, Total reward 10.0
Episode 50, Total reward 10.0
Episode 60, Total reward 10.0
Episode 70, Total reward 10.0
Episode 80, Total reward 10.0
Episode 90, Total reward 10.0
Episode 100, Total reward 10.0
Average Total Reward of last 100 episodes: 9.94
```


### Evaluation
- **Objective:** average reward of 200 over 100 episodes
- *First Target:* reach a reward > 50 (at least once)
- We only focus on tuning model parameters for now

___

**Your turn now !**

```
$ jupyter notebook exercises.ipynb
```

Be warned: your results will vary from one run to the other (random init of the model)

# DQN First Results

### POLL: who reached a reward > 50?


### POLL: how many neurons? more than 20, 50, 100?

### POLL: how many layers? 1,2, more?

### POLL: what loss? mse, mean_absolute_error, another?

### POLL: what learning rate? more than $10^{-4}, 10^{-3}, 10^{-2}?$

## Case Study

env.run(agent, episodes=100, seed=0)

### Symptom

```
agent.plot_loss()
```

| | |
|--|--|
| ![](images/loss_dqn_1.png) | ```Total Reward: 10.0``` |

### POLL: Is it a problem of ...
  - Model capacity (#neurons) ?
  - Loss function ?
  - Activation ?
  - Exploration ?

```
agent.plot_action()
```

![](images/action_dqn_1.png)

**Take away:** Q-learning can converge only if it explores enough actions (hence states)

# DQN with Exploration

Exploration of the states is crucial for performance

  - add an uniform exploration mechanism
  - decrease exploration over time
  
This is our first agent which is going to solve the task. It will typically require to run a few hundred of episodes to collect the data.

In [None]:
class DQNAgentWithExploration(DQNAgent):
    def __init__(self, observation_space, action_space):
        super(DQNAgentWithExploration, self).__init__(observation_space, action_space)
        # exploration schedule parameters 
        self.t = 0
        self.epsilon = ???
        # TODO store your additional parameters here 

    # decay epsilon
    def update_epsilon(self):
        self.t += 1
        # TODO write the code for your decay  
        self.epsilon = ???

In [None]:
agent = DQNAgentWithExploration(env.observation_space, env.action_space)
env.run(agent, episodes=500, print_delay=50)

```
Layer (type)                 Output Shape              Param #   
=================================================================
dense_7 (Dense)              (None, 30)                150       
_________________________________________________________________
dense_8 (Dense)              (None, 2)                 62        
=================================================================
Total params: 212
Trainable params: 212
Non-trainable params: 0
_________________________________________________________________
Episode 50, Total reward 10.0
Episode 100, Total reward 37.0
Episode 150, Total reward 139.0
Episode 200, Total reward 200.0
```
(your mileage may vary...)

___

**Your turn now !**

PS: if your current DQN is really poor (not reaching reward > 10) you can cheat by using:
```
from decent import DQNAgent
```

# DQN/Explore Results


### POLL: who reached a reward > 50 ?

### POLL: who reached a reward > 200 ?

## Case Study

### POLL: which one is DQN ? DQN/Ex ?

`agent.plot_state()`

| | |
|---|--- |
| ![](images/state_dqn_2.png) | ![](images/state_dqne_1.png) |


### POLL: who got the message "task solved" ?


### POLL: who solved in < 200 episodes ?

*hint*: can we optimize gains from past experience ?



## DQN/Ex with Replay

### Idea: correct past model updates with newest/better $Q$ function
  
  <img src="images/replay.png" width=640>
  <center><sub><sup>Image retrieved from http://www.modulabs.co.kr</sup></sub></center>
  - Prioritized Replay - https://arxiv.org/abs/1511.05952 - goes one step further by weighting the sampling

In [None]:
class DQNAgentWithExplorationAndReplay(DQNAgentWithExploration):
    def __init__(self, observation_space, action_space):
        ...
        # create replay memory using deque
        self.memory = deque(maxlen=???)
        self.batch_size = ???

    def train_model(self, action, state, next_state, reward, done):
        
        # save sample <s,a,r,s'> to the replay memory
        self.memory.append((state, action, reward, next_state, done))
        
        if len(self.memory) >= self.train_start:
            ...

___

**Your turn now !**

*Hint: if stuck with $\epsilon$ in previous step you can try $\epsilon(t) = 1 / \sqrt{t}$*  

# DQN/Ex/Replay Results


### POLL: who got the message "task solved" ?


### POLL: who solved in < 200 episodes ?

## Case Study

### POLL: Have we converged ?

|  `agent.plot_state()`  |  `agent.plot_bellman()`  |
|--|--|
| ![](images/dqnee_state_1.png) | ![](images/dqnee_bellman_1.png)|

In [None]:
agent.epsilon = 0
agent.memory = deque(maxlen=1)
agent.batch_size = 1
agent.train_start = 1
env.run(agent, episodes=200, print_delay=33)

### POLL: Have we converged ?

`agent.plot_diagnostics()`

<img src="images/dqnee_all_1.png" width=640>

# Potential Problems/Solution (advanced)

In some settings (e.g. in some Atari games) the techniques we used are not enough.

Usually inspecting Bellman residuals and exploration traces provides hints to improve.

Let's see some examples...



## Double DQN
  - *assumption*: our Q estimates are too optimistic
  - *solution*: defer model update to avoid big jumps in target
  
  <img src="images/ddqn-th.png" width=480>

  - practically: "freeze" $Q_2$ for several time steps / episodes

## Dueling DQN
  - *assumption*: action is not so important for many states
  - *solution*: separate $Q$ into state and advantage functions $Q(s,a) = V(s) + A(s,a)$
  
  <img src="images/duel.png" width=240>
  <center><sup><sub>Image retrieved from http://torch.ch/blog/2016/04/30/dueling_dqn.html</sub></sup></center>

___

You can try to implement one of these as an exercise

# Takeaways

## Basics ideas
  - tune the DL model parameters
  - exploration is necessary for Q-learning
  - debug not only with the model loss

 
## Advanced ideas
  - make exploration more efficient (replay)
  - adapt to task specifics (DDQN, Dueling DQN)

## Next steps
  - [John Schulman: nuts and bolts of RL research](http://rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdf)
  - try the Atari gyms... with CNNs and GPUs :)
  - learn Policy Gradient methods