In this problem, you will fill in several blank parts of the code to make DeepQLearning work. You will then answer a few brief questions about the learning progress of a Q-learning algorithm.

Note that we are applying deep-q-learning to cartpole, which has a continuous state space! This is not possible with the tabular version of q-learning from the last problem, because you can not trivially map continuous values into a lookup table.

In [25]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import random
import gym

In [26]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

NStates = 4
NActions = 2

In [15]:
# Define our neural net we will use to estimate the Q-function
class QFunctionNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(QFunctionNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

Next we need to add a replay buffer. This buffer should hold past data collected by the agent. It should have a max memory size of 10,000. The buffer should be circular, in the sense that the oldest data gets removed once it reaches the overflow limit.

You need to implement add_to_buffer, which allows you to add data to the buffer. You also need to implement sample_buffer, which pulls at random (with replacement) 32 datapoints from the buffer and returns them.

In [16]:
class ReplayBuffer:
    def __init__(self, mem_size=10000):
        self.mem_count = 0
        self.mem_size = mem_size
        self.states = np.zeros((mem_size, 4), dtype=np.float32)
        self.actions = np.zeros(mem_size, dtype=np.int64)
        self.rewards = np.zeros(mem_size, dtype=np.float32)
        self.next_states = np.zeros((mem_size, 4), dtype=np.float32)
        self.terminated = np.zeros(mem_size, dtype=bool)

    def add_to_buffer(self, state, action, reward, next_state, terminated):
        # add state, action, reward, next_state, terminated to the buffer. Note you only have one copy of each. For example, state is just a (4,) dimensional array.
        self.states[self.mem_count % self.mem_size, range(4)] = state
        self.actions[self.mem_count % self.mem_size] = action
        self.rewards[self.mem_count % self.mem_size] = reward
        self.next_states[self.mem_count % self.mem_size, range(4)] = next_state
        self.terminated[self.mem_count % self.mem_size] = terminated
        self.mem_count += 1
        return True

    def sample_buffer(self, batch_size=32):
        MEM_MAX = np.min([self.mem_count, self.mem_size])
        batch_indices = np.random.choice(MEM_MAX, batch_size, replace = True)
        states_mb = self.states[batch_indices,]
        actions_mb = self.actions[batch_indices]
        rewards_mb = self.rewards[batch_indices]
        next_states_mb = self.next_states[batch_indices,]
        terminated_mb = self.terminated[batch_indices]
        return states_mb, actions_mb, rewards_mb, next_states_mb, terminated_mb

RL agents are typically writen as classes with three essential methods. First, an initialization. Second, train which takes a batch of states, actions, rewards, next states, and termination signals and trains the neural network loss. Finally, a get_action function which takes the current observation and returns the action the agent should take.

In [21]:
class QLearningAgent:
    def __init__(self, env, epsilon_greedy_threshold=0.50, epsilon_decay=0.995):
        self.env = env
        #self.replayBuffer = ReplayBuffer()
        self.q_func = QFunctionNet(input_size=NStates, num_classes=NActions, hidden_size=128).to(device)

        self.criterion = nn.MSELoss()
#         self.criterion = nn.SmoothL1Loss()
#         self.optimizer = torch.optim.RMSprop(self.q_func.parameters())
        self.optimizer = torch.optim.Adam(self.q_func.parameters(), lr=0.0001)

        self.epsilon_greedy_threshold = epsilon_greedy_threshold
        self.epsilon_decay = epsilon_decay

        self.gamma = 0.95

        self.total_training_steps = 0

    def train(self, states, acts, rews, next_states, terminated):

        #Torch and numpy don't like each other. So you have to spend a lot of time doing these kinds of conversions from numpy data that is stored in the buffer into Torch data. This is just making sure Torch can process the data, by converting it into a tensor.
        states_torch = torch.from_numpy(states).float()
        acts_torch = torch.from_numpy(acts)
        next_states_torch = torch.from_numpy(next_states).float()
        rews_torch = torch.from_numpy(rews).float()
        terminated_torch = torch.from_numpy(terminated).float()

        # you need this to grab the q-values at the current state.
        batch_indices = np.arange(len(states), dtype=np.int64)


        q_values = self.q_func(states_torch)# use your neural net
        next_q_values = self.q_func(next_states_torch)# use your neural net again

        # This is the Q-values at the actions that were actually taken. So you need to fill in the second index to account for the actions that were taken.
        predicted_q_values_at_actions_taken = predicted_value_of_now = q_values[batch_indices, acts_torch]

        # remember how Q-learning assumes you take the action leading to the maximal q-value at the next step? It's the argmax part of the td-target. This is the code version of that situation. You want to look at next_q_values and take the max along the action dimension, which will give you the estimated q-value at the best possible action.

        # Use the torch.max function. 
        predicted_q_values_at_best_future_actions = torch.max(next_q_values, dim=1)[0]####

        # Fill in the missing term in the q-target.
        q_target = rews_torch + self.gamma * predicted_q_values_at_best_future_actions * (1-terminated_torch)


        loss = self.criterion(q_target, predicted_q_values_at_actions_taken)

        self.optimizer.zero_grad()
        loss.backward()

        self.optimizer.step()

        if self.total_training_steps % 1 == 0:
            # There are a few ways to handle decreasing the epsilon greedy threshold.
            # They tend to involve either multiplication or subtraction.
            # You want the threshold to decay towards 0 over the course of training.
            self.epsilon_greedy_threshold *= self.epsilon_decay

        self.total_training_steps += 1

    def get_action(self, obs, evaluate=False):
        obs_correct_format = torch.from_numpy(np.expand_dims(obs, 0))
        action_values = self.q_func(obs_correct_format)
        action_values = action_values.cpu().detach().numpy()[0]

        print(f"action values: {action_values}")

        # during evaluation, we do not want epsilon greedy sampling.
        if evaluate is False:
            p = random.uniform(0, 1)
            if p > self.epsilon_greedy_threshold:
                action = np.argmax(action_values) 
                print(f"selected action: {action}")
            else:
                action = self.env.action_space.sample()
            return action
        else:
            # we are not evaluating, so we always want to return the best action with no exploration.
            action = np.argmax(action_values)
            return action

Fancy plotting utility shown in class. Can produce a plot in real time, making it look kind of like a video. It's fun to use sometimes, although not really practical in real life.

In [22]:
def plot_fancy(steps_survived_history):
    plt.figure(2)
    plt.clf()
    #durations_t = torch.FloatTensor(episode_durations)
    steps_survived_history_torch = torch.FloatTensor(steps_survived_history)
    plt.title('Training...')
    plt.xlabel('Episode')
    plt.ylabel('Duration')
    plt.plot(np.array(steps_survived_history))
    # take 100 episode averages and plot them too
    if len(steps_survived_history) >= 20:
        means = steps_survived_history_torch.unfold(0, 20, 1).mean(1).view(-1)
        means = torch.cat((torch.zeros(19), means))
        plt.plot(means.numpy())

    plt.pause(0.001)  # pause a bit so that plots are updated

In [37]:
env = gym.make('CartPole-v1')
env.action_space.sample()

1

In [51]:
action_space = [0,1]
random.sample(action_space, k=1)[0]

0

Now all we need is the main loop.

In [23]:
def main():

    env = gym.make('CartPole-v1')
    env.action_space.seed(42)
    env.observation_space.seed(42)
    np.random.seed(42)
    random.seed(42)

    obs, info = env.reset()

    replay_buffer = ReplayBuffer()
    q_learning_agent = QLearningAgent(env = env)

    # total training episodes
    epochs = 2000

    # Store the history of how long we survive.
    steps_survived_history = []
    x_labs = []

    for ep in range(epochs):
        obs, info = env.reset()
        terminated = False

        steps_survived = 0

        while not terminated:

            action = q_learning_agent.get_action(obs=obs)

            next_obs, reward, terminated, truncated, info = env.step(action)

            replay_buffer.add_to_buffer(state=obs, action=action,
                                        reward=reward, next_state=next_obs,
                                        terminated=terminated)

            obs = next_obs

            steps_survived += 1

            if steps_survived > 450:
                break

        if(len(replay_buffer.states)) > 100:
            # train 20 times per episode collected.
            for i in range(0, 50):
                states_mb, acts_mb, rews_mb, next_states_mb, terminated = replay_buffer.sample_buffer(batch_size=32)
                q_learning_agent.train(states_mb, acts_mb, rews_mb, next_states_mb, terminated)

        if ep % 7 == 0 and ep > 0:

            steps_survived = 0
            obs, info = env.reset()
            terminated = False
            
            # We do an evaluation loop without exploration.
            while not terminated:
                steps_survived += 1
                action = q_learning_agent.get_action(obs=obs, evaluate=True)

                next_obs, reward, terminated, truncated, info = env.step(action)

                obs = next_obs

                if steps_survived > 450:
                    break

            steps_survived_history.append(steps_survived)
#             print(steps_survived)

            #uncomment the line below for fancy plotting.
            # plot_fancy(steps_survived_history)
            x_labs.append(ep)

    plt.plot(x_labs, steps_survived_history, label='Steps Survived')
    plt.legend()
    plt.show()
    print(np.mean(steps_survived_history[-100:-1]))

In [24]:
if __name__ == "__main__":
    main()

action values: [ 0.04499604 -0.04223739]
selected action: 0
action values: [0.07835595 0.00536241]
action values: [0.11768278 0.04848418]
action values: [0.07801969 0.00709786]
action values: [ 0.04366037 -0.03662141]
selected action: 0
action values: [0.08174714 0.01110867]
selected action: 0
action values: [0.12365519 0.05478377]
selected action: 0
action values: [0.16957885 0.10199633]
action values: [0.2116831  0.14385799]
action values: [0.2484636  0.18798916]
action values: [0.2190494  0.14940597]
action values: [0.26347783 0.19477305]
selected action: 0
action values: [0.07049581 0.02529136]
action values: [0.0748513  0.00111941]
action values: [0.07159441 0.02384013]
selected action: 0
action values: [0.08795772 0.06548508]
selected action: 0
action values: [0.11360352 0.10378602]
action values: [0.1464149  0.14682981]
selected action: 1
action values: [0.10454386 0.09841219]
selected action: 0
action values: [0.13982198 0.14286149]
action values: [0.09896974 0.0944542 ]
select

KeyboardInterrupt: 

# Please answer the following questions

# #1 Fill in the missing code.
After finishing the code, plot the learning curve, with the x-axis being the number of training steps, and the y-axis being the amount of time cartpole survives. Note that this code will automatically be generated by lines 74-76 above, you don't have to write any additional code to do this.

As we saw in class, the return can be quite unstable. Please include this plot in your final write-up.


# #2 Can you do better?

Spend 10-30 minutes playing around with the algorithm. You might want to change the discount factor gamma. You can also change the number of times you train per episode collected, which is currently set to 20. Another interesting value is the rate at which the epsilon threshold in epsilon greedy exploration decays. You could also try changing the number of hidden units in the neural network. Or the optimizer learning rate.

Please include your best learning curve in the final write-up. More importantly, what parameters were most important to training? Are there any parameters that negatively impact the results if they are changed?