In [1]:
!pip install gym

Processing c:\users\casti\appdata\local\pip\cache\wheels\c0\84\61\523b92d88787ae29689b3cc08cf445d8d8186d7fbe1acbf87b\gym-0.17.1-cp37-none-any.whl
Collecting cloudpickle<1.4.0,>=1.2.0
  Downloading cloudpickle-1.3.0-py2.py3-none-any.whl (26 kB)
Collecting pyglet<=1.5.0,>=1.4.0
  Using cached pyglet-1.5.0-py2.py3-none-any.whl (1.0 MB)
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
Building wheels for collected packages: future
  Building wheel for future (setup.py): started
  Building wheel for future (setup.py): finished with status 'done'
  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491062 sha256=1fe0622cd5ebd0104870e578ae744ec434330eb0cf15a2f70c91cba74ffacb18
  Stored in directory: c:\users\casti\appdata\local\pip\cache\wheels\56\b0\fe\4410d17b32f1f0c3cf54cdfb2bc04d7b4b8f4ae377e2229ba0
Successfully built future
Installing collected packages: cloudpickle, future, pyglet, gym
Successfully installed cloudpickle-1.3.0 future-0.18.2 gym-0.17.1 pyg

In [2]:
!pip install stable_baselines

Collecting stable_baselines
  Using cached stable_baselines-2.10.0-py3-none-any.whl (248 kB)
Collecting atari-py~=0.2.0; extra == "atari"
  Using cached atari_py-0.2.6-cp37-cp37m-win_amd64.whl (1.8 MB)
Collecting Pillow; extra == "atari"
  Downloading Pillow-7.1.2-cp37-cp37m-win_amd64.whl (2.0 MB)
Installing collected packages: stable-baselines, atari-py, Pillow
Successfully installed Pillow-7.1.2 atari-py-0.2.6 stable-baselines-2.10.0


In [4]:
import tensorflow

In [7]:
!pip freeze | grep stable-baselines
#from stable_baselines.common.vec_env import DummyVecEnv
#from stable_baselines.deepq.policies import MlpPolicy
#from stable_baselines import DQN

Der Befehl "grep" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


# Reinforcement Learning

So far we have grouped Machine Learning into Supervised Learning and
Unsupervised Learning. There is a third branch of Machine Learning called
Reinforcement Learning. It is motivated by the way humans are belived to learn,
by **interacting with their environment**.

The goal of Reinforcement Learning is to map actions to situations (states) so as to 
**maximize a numerical reward signal**.

https://www.youtube.com/watch?v=kopoLzvh5jY

## How does it work?

We define an **agent**, that takes **actions**. These actions lead to a **reward** and influence the **state** of the environment.

![title](sar.jpg)

## Problem Definition

Reinforcement Learning problems are commonly defined as Markov Decision Processes (MDP) that are defined by a **state space** *S*, an **action space** *A*, a **state transition function** *P*, a **reward function** *r* and a **discount factor** *$\gamma$*.

### State Space *S*

$S = \{s_1, s_2, ..., s_n\}$

All n possible states of the environment.

### Action Space *A*

$A = \{a_1, a_2, ..., a_m\}$

All m possible actions of the agent.

### State Transition Function *P*

$P(s', r|s, a)$

The probability distribution of entering state $s'$ and receiving reward *r* after choosing action *a* in state *s*. It defines the dynamics of the MDP.

### Reward Function

$R(a, s)$

The reward of the agent for choosing action a in state s.

### Discount Factor

$\gamma$

The factor with which future rewards are discounted. Usually a discount factor $\gamma < 1$ is used to indicate that future reward is worth less than current reward.

## Goal

Find a policy $\pi(a|s)$ that maximizes the total expected reward $V_\pi(s) = E[G_t|S_t = s] \forall s$ where $G_t = R_{t+1} + \gamma * R_{t+2} ... + \gamma^{p-1} * R_{t+p}$ is the **return**.

#### Return

$G_t = R_{t+1} + \gamma * R_{t+2} ... + \gamma^{p-1} * R_{t+p}$

The return at time t is the cumulated future discounted return.

#### Policy

$\pi(a|s)$

Is a mapping of a single action to every state of the environment.

#### State-Value Function

$V_\pi(s) = E[G_t|S_t=s]$

The state-value function for policy $\pi(a|s)$ is the total expected return from being in state S=s and following the policy $\pi(a|s)$.

#### Action-Value Function

$Q_\pi(s, a) = E[G_t|S_t=s, A_t=a]$

The action-value function for policy $\pi(a|s)$ is the total expected return from being in state S=s, chosing action A=a and thereafter following policy $\pi(a|s)$.

One important property of both value functions is that they are **recursive relationships**. E.g.


$Q_\pi(s, a) = E[G_t|S_t=s, A_t=a] = $

$E[R_{t+1} + \gamma*G_{t+1}|S_t=s, A_t=a] = $

$E[R_{t+1}|S_t=s, A_t=a] + E[\gamma*V_\pi(s')|S_t+1=s']$

This is called the **Bellman equation** and is central to Reinforcement Learning.

## Summary

We want to find a policy that optimizes the return (sum of discounted future rewards) of the agent.

Let's look at an example.

Assumptions:

- Mouse can go left, right, up and down
- If the mouse finds cheese, either in the upper right corner or in the lower left corner, the environment terminates
- The mouse prefers two blocks of cheese over one block of cheese

![](mouse_grid.png)

So what are state space, action space, state transition function, reward function and discount factor?

## How do we find the policy?

1. Dynamic Programming: Do not learn from the environment and require knowledge of the dynamics and the rewards
- Monte Carlo Methods: Learn directly from the environment but only after the final outcome is observed
- Temporal Difference Learning: Learn directly from the environment and update estimates from other learned estimates
- Policy Gradients: Approximate the value function through a parametrized function (Deep Reinforcement Learning)

## Drawbacks

* Credit assignment Problem
* Exploration vs. Exploitation

---

### Implement it in practice using OpenAI's Gym
* A handy library for learning about RL - https://gym.openai.com/

`pip install gym`

In [2]:
import gym
import time
import numpy as np

---

### Let's work on the cartpole problem
#### First we make an environment in which the agent can be trained

In [3]:
env = gym.make('CartPole-v1')

In [4]:
env.reset()
for i in range(1000):
    env.render()
    obs, reward, done, _ = env.step(env.action_space.sample()) # take a random action
    time.sleep(0.08)
    if done:
        print(f'We survived {i} steps')
        env.reset()
        break
env.close()

We survived 22 steps


#### Now we implement the agent-environment loop
* Start the process by resetting the environment
* And return an initial observation

In [5]:
initial_obs = env.reset()

In [6]:
initial_obs
#position of cart, velocity of cart, angle of pole, rotation of pole

array([-0.0194354 , -0.02698667, -0.00820107,  0.04741225])

\[position of cart, velocity of cart, angle of pole, rotation rate of pole\]

We can achieve the same thing by taking an action - in this case a  `step` in a given direction, 0 for left and 1 for right

In [7]:
obs = env.step(0) # move cart left 
obs, reward, done, _ = env.step(1)

We can already use the `done` boolean to work out if we can stop the loop

In [8]:
obs, reward, done, _

(array([-0.02441494, -0.02676566, -0.0005029 ,  0.0425352 ]), 1.0, False, {})

And use `sample` the `action_space` space to randomly pick an action

In [9]:
random_step = env.action_space.sample()

And `render` the environment to see what our cart is doing

**OK, but we need to build an RL agent. What next?**

First, lets try to build the simplest RL agent:
* If the pole is left, move left
* If the pole is right, move right

In [10]:
def simple_rl(env):
    #reset the environment and taking an initial step
    obs = env.reset()
    
     #loop over this process until I die
    for i in range(1000):
        
    #measure: is my pole angled to the left, or the right
    #action: if pole is left, move cart left. if pole is right, move right
        if obs[2] < 0:
            action = 0
        elif obs[2] >0:
            action = 1
        elif obs[2] == 0:
            print('omgomgomg were amazing')
            break
            
        obs, reward, done, _ = env.step(action)
        env.render()
        time.sleep(0.08) #to make the video play at a normal rate
        if done:
            print(f'iterations survived: {i}')
            env.close()
            break

In [11]:
#benchmark for a dumb rl agent = 50

In [13]:
simple_rl(env)

iterations survived: 38


### Let's look at some evolutionary algorithm

In [14]:
parameters = np.random.rand(4) * 2 - 1

In [15]:
parameters

array([0.2584895 , 0.1784583 , 0.36364698, 0.06803947])

In [16]:
observation = env.reset()
observation

array([-0.01544127,  0.00253193, -0.01656722,  0.01944529])

In [17]:
np.matmul(parameters, observation)

-0.008241133094307159

In [18]:
action = 0 if np.matmul(parameters,observation) < 0 else 1
action

0

In [19]:
def run_episode(env, parameters, range_=200, render=False):  
    observation = env.reset()
    totalreward = 0
    
    for _ in range(range_):
        action = 0 if np.matmul(parameters,observation) < 0 else 1
        observation, reward, done, info = env.step(action)
        totalreward += reward
        if render:
            env.render()
            time.sleep(0.08)
        if done:
            break
            
    env.close()
    return totalreward

In [20]:
run_episode(env, parameters, render=True)

9.0

#### Random Search

In [21]:
re_re = 400
bestparams = None  
bestreward = 0  
for i in range(1000):  
    parameters = np.random.rand(4) * 2 - 1
    reward = run_episode(env,parameters, range_=re_re)
    
    if reward > bestreward:
        bestreward = reward
        bestparams = parameters
        # considered solved if the agent lasts 200 timesteps
        if reward == re_re:
            print(f'{i} episodes required to reach a reward of {re_re}')
            break

22 episodes required to reach a reward of 400


In [22]:
bestparams

array([-0.31692137,  0.69724418,  0.88668155,  0.66132017])

In [23]:
bestreward

400.0

In [24]:
run_episode(env, bestparams, re_re, True)

400.0

### DQN

In [25]:
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.deepq.policies import MlpPolicy
from stable_baselines import DQN

env = DummyVecEnv([lambda: gym.make('CartPole-v1')])

model = DQN(MlpPolicy, env, verbose=1)

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "C:\Users\casti\AppData\Roaming\Python\Python37\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "C:\Users\casti\AppData\Roaming\Python\Python37\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "C:\Users\casti\AppData\Roaming\Python\Python37\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "C:\Users\casti\AppData\Local\Continuum\anaconda3\lib\imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "C:\Users\casti\AppData\Local\Continuum\anaconda3\lib\imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: DLL load failed: Das angegebene Modul wurde nicht gef

TypeError: can only concatenate str (not "list") to str

In [26]:
model.learn(total_timesteps=100000)
model.save("deepq_cartpole")

# del model # remove to demonstrate saving and loading

# model = DQN.load("deepq_cartpole")

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "C:\Users\casti\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-3719cea8be06>", line 1, in <module>
    model.learn(total_timesteps=100000)
NameError: name 'model' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\casti\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2040, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'NameError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\casti\AppData\Roaming\Python\Python37\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflo

NameError: name 'model' is not defined

In [27]:
obs = env.reset()
done = False
i = 0
total_rewards = []
while not done:
    i += 1
    action, _states = model.predict(obs)
    obs, rewards, done, info = env.step(action)
    total_rewards.append(rewards)
    time.sleep(0.08)
    env.render()
print(f'{i} episodes')

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "C:\Users\casti\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-27-c73328c4e999>", line 7, in <module>
    action, _states = model.predict(obs)
NameError: name 'model' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\casti\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2040, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'NameError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\casti\AppData\Roaming\Python\Python37\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorfl

NameError: name 'model' is not defined