# DRL Week 3, Python Exercise 

In this exercise we implement a simple environment and an agent which takes random actions. The goals of this exercise are: <br>
* To get familiar with the Python language
* To implement explicitly some basic elements of an RL problem
* To get used to the basic vocabulary (most RL literature is in english).

# Specification of the task

### Define the Rewards and Transitions of the Environment

In [1]:
# Use a list [] to store the reward we get at each state
# https://www.w3schools.com/python/python_lists.asp
rewards = [0, -1, 0, -10, 4, 3, 7, 5]

In [2]:
# Access by (zero-based) index:
print('reward of state s4:')
r = rewards[4]
print(r)

# store the stateID in a variable. Use .format() for string-interpolation
stateID = 5
r = rewards[stateID]
print('At state {} I obtain a reward r = {}'.format(stateID, r))

print('\nUse a for-loop to iterate over all indices:')
for i in range(len(rewards)):
    # In Python, indentation is important! All indented lines belong to the body of the for loop
    print('At state s{} you obtain a reward of {}'.format(i, rewards[i]))
    # this line is still inside the loop
# this line is outside the loop


reward of state s4:
4
At state 5 I obtain a reward r = 3

Use a for-loop to iterate over all indices:
At state s0 you obtain a reward of 0
At state s1 you obtain a reward of -1
At state s2 you obtain a reward of 0
At state s3 you obtain a reward of -10
At state s4 you obtain a reward of 4
At state s5 you obtain a reward of 3
At state s6 you obtain a reward of 7
At state s7 you obtain a reward of 5


In [3]:
# The transition matrix T is used to store the following information:
# "For each state s, and action a, what is the next state?"

# Not all states have a next_state. Some states are TERMINAL. Depending on context, such
# states are sometimes also called goal states.
# For each state we store whether it's terminal or not.

is_terminal = [False, False, False, True, False, True, True, True]


print(is_terminal[5])

# We have 8 states (s0 to s7) and 2 actions (a0 and a1). We can store the 
# transitions in a 7-by-2 matrix.
# In python we can use a list of lists. Later we will use numpy, which provides matrices.
T = [ 
    [1, 2],  # in state s0, taking action a0 brings us to state s1, a1 to s2
    [3,4],   # in state s1, taking action a0 brings us to state s3, a1 to s4
    [5,6],   # in state s2, taking action a0 brings us to state s5, a1 to s6
    [-1, -1], # state s3 is a terminal state. next-state is undefined (we use -1 here).
    [7, 7],   # from state s4, both actions lead to state s7
    [-1, -1], # state s5 is a terminal state
    [-1, -1], # state s6 is a terminal state
    [-1, -1]  # state s7 is a terminal state
]
print(T)

True
[[1, 2], [3, 4], [5, 6], [-1, -1], [7, 7], [-1, -1], [-1, -1], [-1, -1]]


In [4]:
# example of how to use the transition matrix T
current_state = 2
action = 0
next_state = T[current_state][action]
print(next_state)

5


In [5]:
# Iterate over all states and actions.
# Note two things:
# 1. range(8) iterates from 0 to 7 . 8 is not included
# 2. note how nested arrays are accessed/indexed. 
#   Later, when using numpy, this is done differently
for s in range(8): 
    for a in range (2):
        next_state = T[s][a] 
        print('T({},{}) -> {}'.format(s, a, next_state))

T(0,0) -> 1
T(0,1) -> 2
T(1,0) -> 3
T(1,1) -> 4
T(2,0) -> 5
T(2,1) -> 6
T(3,0) -> -1
T(3,1) -> -1
T(4,0) -> 7
T(4,1) -> 7
T(5,0) -> -1
T(5,1) -> -1
T(6,0) -> -1
T(6,1) -> -1
T(7,0) -> -1
T(7,1) -> -1


### Define the policy of the agent

We have now defined the environment. Next, we define an agent, that is, we have to define a function that selects actions. In this simple example, our agent selects randomly between the two possibilities with 50% probability

In [6]:
# Python is organized in modules. There are thousands of modules with a plethora of functions. 
# Before we can use a function from the random module, we have to import it:
# https://www.w3schools.com/python/module_random.asp
import random
# just a few test calls:
print(random.random())
print(random.random())
print(random.randint(-800,1000))
print(random.randint(-800,1000))

0.5279799407556023
0.7994987328162075
124
300


In [7]:
# the agent starts at state s0.
current_state = 0
# the sum of rewards R (also called the Return) is initialized with 0
R = 0

# In our environment, one can select from two actions, a0 and a1. Our agent follows a random policy where 
# each action is take with equal probability:
action = random.randint(0,1)

# observe the next state and the reward
next_state = T[current_state][action]
r = rewards[next_state] # use lower case r for the immediate reward. Upper case R is the sum of rewards, called Return.
R += r

print('I took action a{} and moved to state S{}. My reward for doing this is {}.'.format(action, next_state, r))

I took action a1 and moved to state S2. My reward for doing this is 0.


# Exercises

### Exercise 1
* make sure you understand the code examples.
* run the previous code block a few times and observe how the agent takes random actions.

### Exercise 2
An **episode** is a sequence of states and actions from a start state to a goal state (=terminal state). Our goal is to estimate the **Return R** an agent can expect to collect in one full episode.<br>

* Copy the code example given above
* Wrap it inside a loop, such that the agent runs a full episode
* Run this loop a few times and observe the total reward **R** the agent collects in each episode. Do you get reasonable results?


In [8]:
# the agent starts at state s0.
current_state = 0
# the sum of rewards R (also called the Return) is initialized with 0
R = 0

while (not is_terminal[current_state]):
    # In our environment, one can select from two actions, a0 and a1. Our agent follows a random policy where 
    # each action is take with equal probability:
    action = random.randint(0,1)

    # observe the next state and the reward
    next_state = T[current_state][action]
    r = rewards[next_state] # use lower case r for the immediate reward. Upper case R is the sum of rewards, called Return.
    R += r

    print('I took action a{} and moved to state S{}. My reward for doing this is {}.'.format(action, next_state, r))
    current_state = next_state

print('My actions gave me a total reward R of {}'.format(R))

I took action a0 and moved to state S1. My reward for doing this is -1.
I took action a1 and moved to state S4. My reward for doing this is 4.
I took action a1 and moved to state S7. My reward for doing this is 5.
My actions gave me a total reward R of 8


### Exercise 3
* Wrap the code from exercise 2 inside a loop which completes 10'000 episodes. Each episode starts at state S0 and terminates at one of the terminal states.
* Calculate the mean of the total reward **R**


In [9]:
from statistics import mean
# list of saved rewards R
got_rewards = []

for x in range(10000):
    # the agent starts at state s0.
    current_state = 0
    # the sum of rewards R (also called the Return) is initialized with 0
    R = 0
    while (not is_terminal[current_state]):
        # In our environment, one can select from two actions, a0 and a1. Our agent follows a random policy where 
        # each action is take with equal probability:
        action = random.randint(0,1)

        # observe the next state and the reward
        next_state = T[current_state][action]
        r = rewards[next_state] # use lower case r for the immediate reward. Upper case R is the sum of rewards, called Return.
        R += r

        # print('I took action a{} and moved to state S{}. My reward for doing this is {}.'.format(action, next_state, r))
        current_state = next_state

    got_rewards.append(R)


print('Total rewards R measured: {}'.format(len(got_rewards)))
print('Mean of total reward: {}'.format(mean(got_rewards)))

Total rewards R measured: 10000
Mean of total reward: 1.883


### Exercise 4 
In the previous exercises, the agent took actions with equal probabilities: $\pi(s, left) = \pi(s, right) = 0.5 $.
* Change this policy to $\pi(s, left) = 0.3 , \pi(s, right) = 0.7 $. (Hint: use this function: https://www.w3schools.com/python/ref_random_random.asp and go 'left' if the random number is <0.3 .
* Rerun the previous experiment with this new policy and estimate the expected total reward.


In [11]:
from statistics import mean
# list of saved rewards R
got_rewards = []

for x in range(10000):
    # the agent starts at state s0.
    current_state = 0
    # the sum of rewards R (also called the Return) is initialized with 0
    R = 0
    while (not is_terminal[current_state]):
        # In our environment, one can select from two actions, a0 and a1. Our agent follows a random policy where 
        # each action is take with the according probability:
        if random.random() < 0.3:
            action = 0
        else:
            action = 1

        # observe the next state and the reward
        next_state = T[current_state][action]
        r = rewards[next_state] # use lower case r for the immediate reward. Upper case R is the sum of rewards, called Return.
        R += r

        # print('I took action a{} and moved to state S{}. My reward for doing this is {}.'.format(action, next_state, r))
        current_state = next_state

    got_rewards.append(R)

print('The mean of a total of {} rewards is {}'.format(len(got_rewards), mean(got_rewards)))

The mean of a total of 10000 rewards is 4.6684
