# SS 2021 SEMINAR 06 Reinforcement Learning in der Sprachtechnologie
## Practice Session: RL Examples in Python

### Announcements

#### Homework
        
* thanks for the received notebooks!

* closer look on this next week

* clarification: important thing is to understand the general ML workflow, NOT to build the most difficult MODEL (different background of students)

#### Paper Presentations

* clarification: idea is to CHALLENGE yourself to understand a scientific paper! the requirement is explicitly NOT to totally understand it in detail, but to see how far you can get. You're welcome to pose open questions to the community in the seminar!

* use 2 minute papers for inspiration

* find a TOPIC that inspires you/catches you!

### A) Paper Presentation

### Title: Depression Detection on Social Media with Reinforcement Learning

#### Link:
[Depression Detection on Social Media with Reinforcement Learning](https://link.springer.com/chapter/10.1007%2F978-3-030-32381-3_49)

#### Summary:

This paper explores the potential of using only the textual information to detect depression based on the content users posted on social media sites.
Since users may post a variety of different kinds of content, only a small number of posts are relevant to the signs and symptoms of depression.
They propose the use of reinforcement learning method to automatically select the indicator posts from the historical posts of
users. After the explanation of the developed model, the authors evaluate it compared to other state-of-the-art methods.

#### Problem/Task/Question:
Can a reinforcement learning method enhance the use of textual information for depression detection?

For this task the following partial problems need to be addressed:
- small number of relevant posts
    - because of that relevant posts will be selected, before classifying based on the selection
- existing database annotations are only available on a user level
    - the parameter update is performed after a user classification

#### Solution/Idea/Model/Approach:
![alt text](https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-32381-3_49/MediaObjects/489562_1_En_49_Fig2_HTML.png "")

**Overall Structure:**
For each Tweet in the historical sequence:
- compute Tweet representations with word embeddings and a one layer LSTM
- the Agent decides on an Action to select the current post or not ( {1,0} )
    based on the current state (current post, selected and irrelevant post)
- this decision is attained by sampling from the multinomial distribution

after selections in one episode are made:
- The average of the indicator post set is calculated.
- This representation is further processed by a multilayer perceptron with dropout.
- After that a sigmoid non-linear layer converts the output into a probability distribution.
- based on that the classifier and agent can be trained


training:
- The agent is trained using standard reinforcement learning algorithm called REINFORCE.
- The objective  of  training  the  agent  is  maximizing  the expected reward under the distribution of the selection policy.
- With the REINFORCE algorithm the gradient can be approximated for the agent and the classifier can be treated as a straightforward classification problem.
- So that on both modules backpropagation can be applied.

dataset:
- 1,402 depressed users and 5,160 non-depressed users with 4,245,727 tweets
- users were labeled as depressed if their anchor tweets satisfied the strict pattern “(I’m/ I was/ I am/ I’ve been) diagnosed with depression”
- non-depressed users were labeled if they had never posted any tweet containing the character string “depress”

**reinforcement learning spaces:**

**state:**
- current post
- average pooling of selected posts
- average pooling of irrelevant posts
- state space: post x avg(selected posts) x avg(irrelevant posts)

**action:**
- decision weather a post is selected(relevant)
- action space: {1,0}
- policy: select the most likely relevant posts, so that the reward from the classifier increases

**reward**
- the likelihood of the ground truth after finishing all the selections of the i-th user
- to encourage the model to delete more posts, additionally a regularization to limit the number of selected posts is included

**transition**
- adding load the next post and add the last to the selected or irrelevant group (and calculate the averages/state space)


#### Results:
- CNN/LSTM + RL methods achieve the best performance,with a value of more than 87% for the F1-measure

**Utility of selected posts:**
- the authors compared the baseline models with three settings: original dataset, selected dataset and unchosen dataset
- on the selected dataset the models can achieve almost 2.4% better scores and the error reduction rate was more than 9%

![alt text](https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-32381-3_49/MediaObjects/489562_1_En_49_Fig3_HTML.png)

**Robustness Analysis in Realistic Scenarios:**

- the authors additionally performed analysis for imbalanced data sets and additional noisy posts

imbalanced data sets:

![alt text](https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-32381-3_49/MediaObjects/489562_1_En_49_Fig4_HTML.png)

the method achieved a stable and outstanding performance even though there is only a very low proportion of users with depression

additional noisy posts:

![alt text](https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-32381-3_49/MediaObjects/489562_1_En_49_Fig5_HTML.png)

- the performance of  the  models  decreased  to  various degrees
-  however, as the number of posts increased, so did the advantage of the proposed model
- at the 90 point, our model outperforms attention-based model over 13% in F1 score

- the noise are posts from an unlabelled depression-candidate dataset
- users are included in this dataset if their anchor tweets loosely contained the character string “depress”

#### Critical Discussion:

* **+** clear structure and visualisation made understanding relatively easy
* **+** the proposed method shows the integration of RL methods in larger natural language processing systems

* the paper addresses people with a background in NLP.
To understand the used functions in detail, the reader must be pretty familiar with topics like: word embeddings, neural networks and reinforcement learning


* **-** the labels of the used dataset are constructed via a strict pattern, which might not depict the ground truth
* **-** the noisy would contain more depressed users than random sampling would, by that the distribution of the noise data is not know.
Because of that the evaluation of the noise data could contain an unnatural bias

***

### Discussion



***

## B) Practice Session

### RL Examples 

Before starting with concrete examples, let us do a short resume of the important RL concepts that we need to consider and then see, how they apply in examples. 


**Concepts we considered:** 

Environment

State Space

Reward (function)

Return

Markov Processes

Markov Decision Processes

Transition Function

Reward (function)

Value Functions

State-Value-Function

Action-Value Function

Agent

Action Space

Observation Space, 

Policy

***
<br>
<br>

Question: Why do we need **Markov Decison Processes**?

Markov Process(MP): (Transition Function, State Space, Init State)

# [A|B|C]

<img src="https://miro.medium.com/max/437/1*SUUir-VGHy2OFqbpKxwuJA.png" width="500"/>

<br>
<br>

Markov Decison Process(MDP): = (Transition Function, State Space, Init State, Reward Function Action Space)

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Markov_Decision_Process.svg/400px-Markov_Decision_Process.svg.png" width="500"/>

<img src="https://i1.wp.com/neptune.ai/wp-content/uploads/Q-table.png?resize=1024%2C448&ssl=1" width="500"/>

* If someone gives me this tuple, Richard BELLMANN proved, that I can solve any MDP **OPTIMALLY** using **Policy Iteration or Value Iteration** by iteratively improving **VALUE FUNCTIONS** (state-value function, state-action-value function)

* This gives us the fundamental to be sure, building a **STATE-VALUE FUNCTION** based on RETURN is actually a valid approach for finding an optimal POLICY

<img src="http://oneraynyday.github.io/assets/mdparrow.png" width="500"/>

* For building a STATE-VALUE-FUNCTION we need to build a **STATE-ACTION-VALUE FUNCTION** first. 

<img src="https://miro.medium.com/max/384/1*RhO9Ulh5nF_zc2pcgBHzAg.png" width="500"/>


In the following we are going to take a detailed look on concrete examples in levels of increasing granularity, starting by simple text based scenarios, going over table based ones heading to concrete implementations in python code.

#### Examples I 


***
>Problem: You’re training your dog to sit. 

> Agent: The part of your brain that makes decisions. 

> Environment: Your dog, the treats, your dog’s paws, the loud neighbor,
and so on. 

>Actions: Talk to your dog. Wait for dog’s reaction. Move your hand. Show
treat. Give treat. Pet. 

>Observations: Your dog is paying attention to you. Your dog is
getting tired. Your dog is going away. Your dog sat on command.
***

***
>Problem: Your dog wants the treats you have. 

>Agent: The part of your dog’s brain
that makes decisions. 

>Environment: You, the treats, your dog’s paws, the loud neighbor, and so on. 

>Actions: Stare at owner. Bark. Jump at owner. Try to steal the
treat. Run. Sit. 

>Observations: Owner keeps talking loud at dog. Owner is showing
the treat. Owner is hiding the treat. Owner gave the dog the treat.
***

***
>Problem: A trading agent investing in the stock market. 

>Agent: The executing DRL
code in memory and in the CPU. 

>Environment: Your internet connection, the
machine the code is running on, the stock prices, the geopolitical uncertainty, other
investors, day traders, and so on. 

>Actions: Sell n stocks of y company. Buy n stocks of y company. Hold. 

>Observations: Market is going up. Market is going down. There
are economic tensions between two powerful nations. There’s danger of war in the continent. A global pandemic is wreaking havoc in the entire world.
***

***
> Problem: You’re driving your car. 

>Agent: The part of your brain that makes
decisions. 

>Environment: The make and model of your car, other cars, other drivers,
the weather, the roads, the tires, and so on. 

>Actions: Steer by x, accelerate by y. Break
by z. Turn the headlights on. Defog windows. Play music. 

>Observations: You’re approaching your destination. There’s a traffic jam on Main Street. The car next to you is driving.
***
***
#### Examples II

***
Game: Hot or Cold

<img src="https://assets.codepen.io/237169/internal/screenshots/pens/OVpbBa.default.png?fit=cover&format=auto&ha=false&height=540&quality=75&v=2&version=1435097797&width=960" width="500"/>

Problem: Guess a randomly selected number using hints.

Observation Space: Int range 0–3. 0 means no guess yet
submitted, 1 means guess is lower than the
target, 2 means guess is equal to the target,
and 3 means guess is higher than the target.

Sample Observation: 2

Action Space: Float from –2000.0–2000.0. The float number the agent is guessing.

Sample Action: –909.37

Reward Function: The reward is the negative log of the distance the agent has guessed toward the target.
***

***
Game: Cart Pole

<img src="https://gym.openai.com/videos/2019-10-21--mqt8Qj1mwo/CartPole-v1/poster.jpg" width="500"/>

Problem: Balance a
pole in a
cart.

Observation Space: A four-element vector
with ranges: from [–4.8,
–Inf, –4.2, –Inf ] to [4.8,
Inf, 4.2, Inf ].
First element is the cart
position, second is the
cart velocity, third is
pole angle in radians,
fourth is the pole velocity at the tip

Sample Observation: [–0.16,–1.61, 0.17,2.44]

Action Space: Int range 0–1.
0 means push
cart left, 1
means push
cart right.

Sample Action: 0

Reward Function: The reward is
1 for every
step taken,
including
the termina-
tion step.
***

***
Game: Lunar Lander

<img src="https://miro.medium.com/max/1346/1*i7lxpgt2K3Q8lgEPJu3_xA.png" width="500"/>

Problem: Navigate a
lander to its
landing
pad.

Observation Space: An eight-element vec-
tor with ranges: from
[–Inf, –Inf, –Inf, –Inf,
–Inf, –Inf, 0, 0] to [Inf,
Inf, Inf, Inf, Inf, Inf, 1, 1].
First element is the x
position, the second
the y position, the third
is the x velocity, the
fourth is the y velocity,
fifth is the vehicle’s
angle, sixth is the
angular velocity, and
the last two values are
Booleans indicating
legs contact with the
ground.

Sample Observation: [0.36 , 0.23,
–0.63, –0.10,
–0.97, –1.73,
1.0, 0.0]

Action Space: Int range 0–3.
No-op (do
nothing), fire
left engine,
fire main
engine, fire
right engine.

Sample Action: 2

Reward Function: Reward for
landing is
200. There’s a
reward for
moving from
the top to
the landing
pad, for
crashing or
coming to
rest, for each
leg touching
the ground,
and for firing
the engines..
***

***
Game: Pong

<img src="https://www.signalpop.com/wp-content/uploads/2018/10/pong-1.png" width="500"/>

Problem: Bounce the
ball past
the oppo-
nent, and
avoid let-
ting the
ball pass
you.

Observation Space: A tensor of shape 210,
160, 3.
Values ranging 0–255.
Represents a game
screen image.

Sample Observation: [[[246, 217,
64], [ 55,
184, 230],
[ 46, 231,
179], . . .,
[ 28, 104,
249], [ 25, 5,
22], [173,
186, 1]],
. . .]]

Action Space: Int range 0–5.
Action 0 is
No-op, 1 is
Fire, 2 is up, 3
is right, 4 is
Left, 5 is
Down.
Notice how
some actions
don’t affect
the game in
any way. In
reality the
paddle can
only move up
or down, or
not move.

Sample Action: 3

Reward Function: The reward is
a 1 when the
ball goes
beyond the
opponent,
and a –1
when your
agent’s pad-
dle misses
the ball.
***

***
Game: Humanoid Robot

<img src="https://images.livemint.com/img/2021/01/22/1140x641/2020-12-30T104653Z_346921282_RC2MXK994162_RTRMADP_3_TECH-ROBOTS-BOSTON-DYNAMICS-DANCE_1611299330093_1611299392327.JPG" width="500"/>

Problem: Make robot
run as fast
as possible
and not fall.

Observation Space: A 44-element (or more,
depending on the
implementation)
vector.
Values ranging from
–Inf to Inf.
Represents the positions and velocities of
the robot’s joints.

Sample Observation: [0.6, 0.08,
0.9, 0. 0, 0.0,
0.0, 0.0, 0.0,
0.045, 0.0,
0.47, . . . ,
0.32, 0.0,
–0.22, . . . , 0.]

Action Space: A 17-element
vector.
Values rang-
ing from –Inf
to Inf.
Represents
the forces to
apply to the
robot’s joints.

Sample Action: [–0.9,
–0.06,
0.6, 0.6,
0.6,
–0.06,
–0.4,
–0.9,
0.5,
–0.2,
0.7,
–0.9,
0.4,
–0.8,
–0.1,
0.8,
–0.03]

Reward Function: The reward is
calculated
based on
forward
motion with
a small
penalty to
encourage a
natural gait.
***

<br>

**WHERE ARE THE TRANSITION FUNCTIONS?**: 

The transition functions are the **PHYSICS/RULES OF THE GAME**. That means to take a look ad the transition functions, we would have to take a look at the underlying code of the games. 

For the cart pole environments, this would be e.g. a small python file that defines the mass of the cart and the pole and implements the basic physics equations, so that the environment changes/transits at each time step according to these. 

***


In [None]:
# RL in LANGUAGE TASKS

# LANGUAGE GAMES being able to be modeled by RL: 
# Translation Agent: see sentence -> action space vocab -> reward based on the appropriateness of the translation. 
# Language Learning: environment with sounds, words, images -> reward from teacher?
# Google Assistant/Personal Assistant: user asks question -> assistant gives answer -> QA-Dialogue!!!
# Information Retrieval: explanation creation -> SCORING => adaptive explanations/generations
# Human Language Learning: producing sounds, BIAS(no,..), GOALS (achieve common ground, understanding, does he/she understand me?), 

# FUTURE SESSION: language immergence, how is language immerging in an environment -> why and how do agents exchange sounds to ACHIEVE THINGS/GOALS?!!! (deepmind group)

# ACTION SPACE: [VOCABULARY]
# ACTION: [] 

## C) Practice

#### Examples III

Game: Frozen Lake

I am using the gym environment of openai today. 
You can install it this way: 

`conda install -c conda-forge gym`

The game we try to solve is the frozen lake game:
<img src="https://camo.githubusercontent.com/f558f268f3c1a45f0a88342113476f34ce894896c30b66fdc3101c8d090a0a0a/68747470733a2f2f616e616c7974696373696e6469616d61672e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f30332f46726f7a656e2d4c616b652e706e67" width="500"/>

For this we use the openai gym environment implementation from here: 
https://gym.openai.com/envs/FrozenLake-v0/

You can find the source code in this repository:
https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py

The environment is an implementation of the same interface as we implemented last week. 

Let's start:

-----

In [43]:
# Check GPU reachability 
import torch
print(torch.__version__)
print(torch.cuda.is_available())

1.7.1
True


In [44]:
import numpy as np
import gym 
import random 
import time
from abc import ABC, abstractmethod
from IPython.display import clear_output

In [45]:
# get an instance of the frozen lake gym environment
env = gym.make("FrozenLake-v0")

In [46]:
# get some information from the environment
action_space = env.action_space
#print(action_space)
action_space_size = env.action_space.n
#print(action_space_size)
state_space = env.observation_space 
print(state_space)
state_space_size = env.observation_space.n
print(state_space_size)

Discrete(16)
16


In [47]:
# Q-TABLE
# build our action-value table | Q-TABLE
# as you already know, the q-table looks like this
# state | action_space

q_table = np.zeros((state_space_size, action_space_size))
#print(q_table)

In [48]:
# Training Parameters
num_episodes = 10000
max_steps_per_episode = 100
num_test_episodes = 3     

# q-learning | update parameters
learning_rate = 0.1
discount_rate = 0.99

# exploration-exploitation trade off
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

In [49]:
# Q-LEARNING

# collect rewards somewhere to visualize our learning curve
rewards_of_all_episodes = []

# Training Loop
for episode in range(num_episodes):
    # reset/initialize the environment first
    state = env.reset()
    # set done back to false at the beginning of an episode
    done = False
    # reset our rewards collector | return for the beginning episode
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        # select an action
        # use our exploration exploitation trade off -> do we explore or exploit in this timestep ?
        exploration_rate_threshold = random.uniform(0,1)
        if(exploration_rate_threshold > exploration_rate):
            action = np.argmax(q_table[state, : ])
        else:
            action = env.action_space.sample()
            
        new_state, reward, done, info = env.step(action)
        
        # Update Q-Table Q(s,a) using the bellman update  
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, : ]))
        
        # update the state to the new state
        state = new_state
        # collect the reward
        rewards_current_episode += reward
        
        if (done == True):
            break
    
    # after we finish an episode, make sure to update the exploration rate
    # decay the exploration rate the longer the time goes on
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    # append our rewards for this episode for learning curve
    rewards_of_all_episodes.append(rewards_current_episode)

In [50]:
# LEARNING STATISTICS 
# for each episode print the stats of the episode
rewards_per_thousand_episodes = np.split(np.array(rewards_of_all_episodes),num_episodes/1000)
count = 1000
print('*****INFO: average reward per thousand episodes: ***** \n')
for reward in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(reward/1000)))
    count += 1000
    
# print our learned q-table
print("\n\n ***** Q-TABLE ***** \n")
print(q_table)

*****INFO: average reward per thousand episodes: ***** 

1000 :  0.05200000000000004
2000 :  0.22600000000000017
3000 :  0.3890000000000003
4000 :  0.5600000000000004
5000 :  0.6200000000000004
6000 :  0.6820000000000005
7000 :  0.6960000000000005
8000 :  0.7160000000000005
9000 :  0.7000000000000005
10000 :  0.7000000000000005


 ***** Q-TABLE ***** 

[[0.60250763 0.54077859 0.53225512 0.5417312 ]
 [0.43348457 0.38128151 0.32437287 0.52260515]
 [0.44849661 0.40866906 0.36844345 0.4796271 ]
 [0.21224305 0.34816654 0.23593559 0.46583193]
 [0.61740758 0.42681705 0.35133093 0.39854066]
 [0.         0.         0.         0.        ]
 [0.1831593  0.19293863 0.37993096 0.07587284]
 [0.         0.         0.         0.        ]
 [0.44894363 0.50043263 0.38555528 0.65065883]
 [0.59576449 0.71645349 0.346324   0.34906457]
 [0.66089752 0.33333752 0.4307606  0.32201233]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.46242914 0.59221847 0.79649711

In [51]:
# EVALUATION | TESTING | watching our agent play
for episode in range(num_test_episodes):
    state = env.reset()
    done = False
    print("INFO:*****EPISODE ", episode+1, "\n\n\n")
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        clear_output(wait=True)
        env.render()
        time.sleep(0.3)
        
        action = np.argmax(q_table[state, :])
        new_state, reward, done, info = env.step(action)
        
        if done:
            clear_output(wait=True)
            env.render()
            if reward == 1:
                print("INFO: ***** agent reached the goal. *****")
                time.sleep(3)
            else:
                print("INFO: ***** agent died.")
                time.sleep(3)
            clear_output(wait=True)
            break
        
        state = new_state

env.close()

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
INFO: ***** agent reached the goal. *****


***
***
***

Game: Taxi 

Find the OpenAI Gym Environment here: https://gym.openai.com/envs/Taxi-v3/

### ENVIRONMENT

In [52]:
# DOMAIN = the data_object (JSON SERIALIZABLE)
class Environment():
    """The main Environment class. It encapsulates an environment with
    arbitrary behind-the-scenes dynamics. An environment can be
    partially or fully observed.
    The main API methods that users of this class need to know are:
        step
        reset
        render
        close
        seed
    And set the following attributes:
        action_space: The Space object corresponding to valid actions
        observation_space: The Space object corresponding to valid observations
        reward_range: A tuple corresponding to the min and max possible rewards
    Note: a default reward range set to [-inf,+inf] already exists. Set it if you want a narrower range.
    The methods are accessed publicly as "step", "reset", etc...
    """
    # Set this in SOME subclasses
    metadata = {'render.modes': []}
    reward_range = (-float('inf'), float('inf'))
    spec = None

    def __init__(self, action_space=None, observation_space=None):
        # Set variables
        self.action_space = action_space
        self.observation_space = observation_space
    
# REPOSITORY = the functionality interface
class EnvironmentRepository(ABC):
    @abstractmethod
    def step(self, action):
        """Run one timestep of the environment's dynamics. When end of
        episode is reached, you are responsible for calling `reset()`
        to reset this environment's state.
        Accepts an action and returns a tuple (observation, reward, done, info).
        Args:
            action (object): an action provided by the agent
        Returns:
            observation (object): agent's observation of the current environment
            reward (float) : amount of reward returned after previous action
            done (bool): whether the episode has ended, in which case further step() calls will return undefined results
            info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
        """
        pass

    @abstractmethod
    def reset(self):
        """Resets the environment to an initial state and returns an initial
        observation.
        Note that this function should not reset the environment's random
        number generator(s); random variables in the environment's state should
        be sampled independently between multiple calls to `reset()`. In other
        words, each call of `reset()` should yield an environment suitable for
        a new episode, independent of previous episodes.
        Returns:
            observation (object): the initial observation.
        """
        pass

    @abstractmethod
    def render(self, mode='human'):
        """Renders the environment.
        The set of supported modes varies per environment. (And some
        environments do not support rendering at all.) By convention,
        if mode is:
        - human: render to the current display or terminal and
          return nothing. Usually for human consumption.
        - rgb_array: Return an numpy.ndarray with shape (x, y, 3),
          representing RGB values for an x-by-y pixel image, suitable
          for turning into a video.
        - ansi: Return a string (str) or StringIO.StringIO containing a
          terminal-style text representation. The text can include newlines
          and ANSI escape sequences (e.g. for colors).
        Note:
            Make sure that your class's metadata 'render.modes' key includes
              the list of supported modes. It's recommended to call super()
              in implementations to use the functionality of this method.
        Args:
            mode (str): the mode to render with
        Example:
        class MyEnv(Env):
            metadata = {'render.modes': ['human', 'rgb_array']}
            def render(self, mode='human'):
                if mode == 'rgb_array':
                    return np.array(...) # return RGB frame suitable for video
                elif mode == 'human':
                    ... # pop up a window and render
                else:
                    # just raise an exception
                    super(MyEnv, self).render(mode=mode)
        """
        pass

    @abstractmethod
    def close(self):
        """Override close in your subclass to perform any necessary cleanup.
        Environments will automatically close() themselves when
        garbage collected or when the program exits.
        """
        pass


In [53]:
# REPOSITORY IMPLEMENTATION = the way how you would like to implement it
class EnvironmentRepositoryImpl(EnvironmentRepository):
    # Initialize / Instance Attributes
    def __init__(self, environment):
        # Set variables
        self.data_object = environment
        print('Environment initialized')

    def step(self, action):
        state = self.data_object.step(action)
        return state

    def reset(self):
        state = self.data_object.reset()
        return state

    def render(self):
        self.data_object.render()

    def close(self):
        state = self.data_object.close()

    def get_action_space(self):
        # get action space from api of the playground or via js in browser using selenium
        action_space = self.data_object.action_space
        return action_space

    def get_observation_space(self):
        # get observation space of the playground from api or via js in browser using selenium
        observation_space = self.data_object.observation_space
        return observation_space

### AGENT

In [54]:
# DOMAIN
class Agent():
    # class variables
    agent_variable = ""
    # class methods
    def __init__(self, agent_variable=""):
        self.agent_variable = agent_variable
        
# REPOSITORY
class AgentRepository(ABC):
    @abstractmethod
    def get_action(self, state): # This is the agent's POLICY, if you will
        """ Agent gets a state as input and returns an action 
        """
        pass

In [77]:
# REPOSITORY IMPLEMENTATION
class AgentRepositoryImpl(AgentRepository):
    # Initializer / Instance Attributes
    def __init__(self, environment, agent):
        # Set variables
        self.data_object = agent
        self.environment = environment
        self.action_space = self.environment.action_space
        self.observation_space = self.environment.observation_space
        self.action_space_size = self.environment.action_space.n
        self.observation_space_size = self.environment.observation_space.n
        # INIT Q-TABLE
        self.q_table = np.zeros((self.observation_space_size, self.action_space_size))
        # INIT AGENT PARAMETERS
        self.learning_rate = 0.7           # Learning rate
        self.discount_rate = 0.618         # Discounting rate
        self.exploration_rate = 1.0        # Exploration rate
        self.max_exploration_rate = 1.0    # Exploration probability at start
        self.min_exploration_rate = 0.01   # Minimum exploration probability 
        self.exploration_decay_rate = 0.01 # Exponential decay rate for exploration probability
        print('Agent initialized.')

    def get_action(self, state):
        #EXPLORATION-EXPLOITATION TRADE OFF
        exploration_rate_threshold = random.uniform(0,1)
        if(exploration_rate_threshold > self.exploration_rate):
            # get action from q table
            action = np.argmax(self.q_table[state, : ])
        else:
            # get random action
            action = self.get_random_action()
        return action
    
    def get_random_action(self):
        #action_set = random.sample(self.action_space, 1)
        #action = action_set[0]
        action = self.action_space.sample()
        return action
    
    def update_q_table(self, state, action, reward, new_state):
        self.q_table[state, action] = self.q_table[state, action] * (1 - self.learning_rate) + self.learning_rate * (reward + self.discount_rate * np.max(self.q_table[new_state, : ]))
        
    def update_exploration_rate(self, episode_num):
        self.exploration_rate = self.min_exploration_rate + (self.max_exploration_rate - self.min_exploration_rate) * np.exp(-self.exploration_decay_rate*episode_num)
    
    def get_exploit_action(self, state):
        action = np.argmax(self.q_table[state, : ])
        return action

### TRAINING

In [78]:
# Training Parameters
num_episodes = 50000        # Total episodes
max_steps_per_episode = 100 # Max steps per episode
num_test_episodes = 5     # Total test episodes

In [83]:
# Setting up the Environment

# get the environment
env = gym.make("Taxi-v3")

# get some information from the environment
action_space = env.action_space
print(action_space)
action_space_size = env.action_space.n
print(action_space_size)
observation_space = env.observation_space 
print(observation_space)
observation_space_size = env.observation_space.n
print(observation_space_size)
reward_range = env.reward_range
print(reward_range)

environment_data_object = env #Environment(action_space, observation_space)
environment = EnvironmentRepositoryImpl(environment_data_object)

# Setting up the Agent
agent_data_object = Agent()
agent = AgentRepositoryImpl(environment_data_object, agent_data_object)

Discrete(6)
6
Discrete(500)
500
(-inf, inf)
Environment initialized
Agent initialized.


In [80]:
# TRAINING
# collect rewards somewhere to visualize our learning curve
rewards_of_all_episodes = []

# Training Loop
for episode in range(num_episodes):
    # reset/initialize the environment first
    state = environment.reset()
    # set done back to false at the beginning of an episode
    done = False
    # reset our rewards collector | return for the beginning episode
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        # select an action
        # use our exploration exploitation trade off -> do we explore or exploit in this timestep ?
        action = agent.get_action(state)
            
        new_state, reward, done, info = environment.step(action)
        
        # Update Q-Table Q(s,a) using the bellman update  
        agent.update_q_table(state, action, reward, new_state)
        
        # update the state to the new state
        state = new_state
        # collect the reward
        rewards_current_episode += reward
        
        if (done == True):
            break
    
    # after we finish an episode, make sure to update the exploration rate
    # decay the exploration rate the longer the time goes on
    agent.update_exploration_rate(episode)
    # append our rewards for this episode for learning curve
    rewards_of_all_episodes.append(rewards_current_episode)

In [81]:
# LEARNING STATISTICS 
# for each episode print the stats of the episode
rewards_per_thousand_episodes = np.split(np.array(rewards_of_all_episodes),num_episodes/1000)
count = 1000
print('*****INFO: average reward per thousand episodes: ***** \n')
for reward in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(reward/1000)))
    count += 1000
    
# print our learned q-table
print("\n\n ***** Q-TABLE ***** \n")
print(agent.q_table)

*****INFO: average reward per thousand episodes: ***** 

1000 :  -45.3959999999998
2000 :  7.407999999999966
3000 :  7.5049999999999635
4000 :  7.351999999999967
5000 :  7.470999999999966
6000 :  7.561999999999962
7000 :  7.5269999999999735
8000 :  7.457999999999962
9000 :  7.48799999999996
10000 :  7.464999999999965
11000 :  7.399999999999965
12000 :  7.312999999999967
13000 :  7.3119999999999585
14000 :  7.335999999999961
15000 :  7.421999999999961
16000 :  7.3349999999999556
17000 :  7.301999999999971
18000 :  7.198999999999964
19000 :  7.298999999999972
20000 :  7.408999999999964
21000 :  7.5029999999999575
22000 :  7.471999999999971
23000 :  7.283999999999966
24000 :  7.396999999999963
25000 :  7.408999999999962
26000 :  7.49899999999996
27000 :  7.361999999999972
28000 :  7.333999999999961
29000 :  7.434999999999964
30000 :  7.4679999999999644
31000 :  7.595999999999968
32000 :  7.188999999999968
33000 :  7.5039999999999605
34000 :  7.349999999999961
35000 :  7.42599999999996
360

In [85]:
# EVALUATION | TESTING | watching our agent play
for episode in range(num_test_episodes):
    state = environment.reset()
    done = False
    print("INFO:*****EPISODE ", episode+1, "\n\n\n")
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        clear_output(wait=True)
        environment.render()
        time.sleep(0.3)
        
        action = agent.get_exploit_action(state)
        new_state, reward, done, info = environment.step(action)
        
        if done:
            clear_output(wait=True)
            environment.render()
            if reward == 20:
                print("INFO: ***** agent reached the goal. *****")
                time.sleep(3)
            else:
                print("INFO: ***** agent missed the goal.")
                time.sleep(3)
            clear_output(wait=True)
            break
        
        state = new_state

environment.close()

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m:[43m [0m|
+---------+
  (South)


In [20]:
# LITTLE HOMEWORK
# 1. get an environment of the openai gym (e.g. cart pole, lunar lander, breakout)
# 2. print the essential information about the environment (state space, action space, ...)
# 3. write an agent class
# 4. train your agent on the environment using Q-Learning (play around with the hyperparameters for your environment)
# 5. Plot your results (average reward, q-table)

***

# TODO's

1. Send your finished presentations (+ possibly annotated paper) by **Monday 12.00 AM/midnight** via email to henrik.voigt@uni-jena.de

2. Send your little HOMEWORK to henrik.voigt@uni-jena.de by using the naming convention: HOMEWORK_02_FIRSTNAME_LASTNAME.ipynb until **June 9th 12.00 AM/midnight**

***
