Open Files/A2_... and study it in Colab by running it.

Observe how all the facets of a reinforcement learning coupled machine/environment system are present.

The notebook includes some code to show how the behaviour of the agent can be rendered, using a random policy that exploits the .sample() method.

Exercise 1:

Can you design a dynamic programming based policy for the agent as in assignment 1? If so, design it and demonstrate that it solves the cart pole problem.

Exercise 2:

Can you design a Monte Carlo based policy for the agent? What ingredients do you require? Explain the design flow, and execute it. Show that it works, or indicate why you can't proceed.

 
Submission:

Submit a pdf containing your answers, including any code you've written.

 

This exercise should take one or two days, however, you'll be given a week.


### Exercise 1:

Can you design a dynamic programming based policy for the agent as in assignment 1? If so, design it and demonstrate that it solves the cart pole problem.

The state of the cart-pole problem is continuous, which means that traditional methods like Dynamic Programming are not very suitable. The way around it would be to discretize the continous space. This can be accomplished by binning the continuous states, thereby creating a "grid" of states, similar to the gridworld problem. 

However, as you can see, the process can be quite slow because the number of possible states after discretization is still very large. 

In [None]:
import numpy as np
import gym

# Define the environment
env = gym.make('CartPole-v0')

# Discretization: We will represent states as discrete values rather than a continuous range.
# We'll use 10 bins for each of the four state variables (cart position, cart velocity, pole angle, pole velocity at tip).
NUM_BINS = [10, 10, 10, 10]  
bins = [np.linspace(-0.5, 0.5, num) for num in NUM_BINS]

# Initialize value table to zeros
values = np.zeros(NUM_BINS + [env.action_space.n])

# Discount factor for future rewards
gamma = 0.95

# Policy: At each state, the agent will choose the action with the highest expected future reward.
policy = np.zeros(NUM_BINS, dtype=int)

def discretize(state):
    """Convert continuous state into discrete bins."""
    return tuple(np.digitize(s, bins[i]) - 1 for i, s in enumerate(state))



In [4]:
for _ in range(1000):  # Iterate
    print(_)
    new_values = np.copy(values)  # Copy current value table
    for state in np.ndindex(*NUM_BINS):  # For each state in the state space
        for action in range(env.action_space.n):  # For each action
            env.reset()
            env.env.state = state
            (next_state, reward, terminated, truncated, info) = env.step(action)
            # next_state, reward, _, _ = env.step(action)  # Take the action
            next_state = discretize(next_state)
            # Bellman equation for value iteration
            new_values[state][action] = reward + gamma * np.max(values[next_state])
    values = new_values  # Update value table

# Update policy
for state in np.ndindex(*NUM_BINS):
    policy[state] = np.argmax(values[state])


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68


KeyboardInterrupt: 

In [None]:
import gym
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Hyperparameters
gamma = 0.95
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
learning_rate = 0.01
episodes = 1000

# Initialize environment
env = gym.make('CartPole-v1')


# Build model
model = Sequential()
model.add(Dense(24, input_dim=4, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(2, activation='linear'))
model.compile(loss='mse', optimizer=Adam(learning_rate=learning_rate))

# Train model
for episode in range(episodes):
    state = env.reset().reshape(1, 4)
    done = False
    time = 0

    while not done:
        time += 1
        if np.random.rand() <= epsilon:
            action = np.random.randint(2)
        else:
            action = np.argmax(model.predict(state))
        
        next_state, reward, done, _ = env.step(action)
        reward = reward if not done else -10
        next_state = next_state.reshape(1, 4)
        target = reward + gamma * np.amax(model.predict(next_state)[0])
        target_f = model.predict(state)
        target_f[0][action] = target
        model.fit(state, target_f, epochs=1, verbose=0)
        
        state = next_state

        if done:
            print("Episode: {}/{}, Score: {}".format(episode, episodes, time))
            break

    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

env.close()


In [None]:
import tensorflow as tf
import keras

print(tf.__version__)
print(keras.__version__)
