# Learning to use Gym, PyTorch and StableBaselines3 for Reinforcement learning
Very simple notebook for learning how to use the three tools mentioned above for reinforcement learning. I might even throw in Weights & Biases if I'm feeling lucky and then eventually move onto using MuJoCo for better physics simulation. There is a lot to do here so need to be ready for a lot of fighting
## To-Do List
- [X] Fix the error where no `torch.tensor` is being pased
- [ ] Run the simulation and try and get a decent model
- [ ] Evaluate the model on a better set of variables
- [ ] Trasnfer to the GPU (somehow? Don't know if CUDA is installed)

In [1]:
import numpy as np
import sklearn as sk
import matplotlib as plt
from tqdm import tqdm
import gym
import matplotlib.pyplot as plt

MAX_EPISODES = 20
MAX_ITERATIONS = 100

gym.__version__


'0.21.0'

In [2]:
# Start with just the simple cartpole problem

env = gym.make('CartPole-v1')
env.reset()

for idx in range(MAX_EPISODES):
    observation = env.reset()
    for t in range(MAX_ITERATIONS):
        env.render()
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        
        if done:
            print(observation)
            print(f"Epsiode finished after {t+1} timesteps")
            break
env.close()

[-0.16037339 -1.0342499   0.21902104  1.6711514 ]
Epsiode finished after 45 timesteps
[ 0.08830717  0.7962994  -0.2097781  -1.46419   ]
Epsiode finished after 26 timesteps
[ 0.21185736  1.5983223  -0.22678718 -2.3919704 ]
Epsiode finished after 22 timesteps
[-0.01129861 -0.0545598   0.22464773  1.1823016 ]
Epsiode finished after 30 timesteps
[ 0.13709576  0.37880445 -0.23063438 -0.9599621 ]
Epsiode finished after 20 timesteps
[-0.08154249  0.00979014  0.22887889  0.7721656 ]
Epsiode finished after 32 timesteps
[-0.17901531 -1.3363675   0.2388248   2.3129249 ]
Epsiode finished after 13 timesteps
[-0.14487733 -0.6426741   0.21945454  1.2984225 ]
Epsiode finished after 17 timesteps
[ 0.08514409  0.18760517 -0.21312739 -0.70261735]
Epsiode finished after 15 timesteps
[ 0.01621801  0.76733685 -0.21554087 -1.5854744 ]
Epsiode finished after 32 timesteps
[ 0.04137275 -0.551702   -0.21166815 -0.12708661]
Epsiode finished after 27 timesteps
[ 0.09635995  1.3669235  -0.2264291  -2.332564  ]
Epsi

In [27]:
"""
Information regarding environment:

Observation space:
(4,) array with elements: [position, velocity, angle, angular velocity]

Action space:
(1,) array that is in the range {0,1} (DISCRETE)
"""

import math 
import random
from collections import namedtuple, deque
from itertools import count

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


device = torch.device("cpu")
print(torch.__version__)

1.11.0


The idea is to convert the work in the "Hands-On ML" book to work in PyTorch. I'd rather work in PyTorch simply for my own work. For deep Q-learning, we get a function $Q^*:\textit{State}\times\textit{Action}\rightarrow \mathbb{R}$ which gives us the return for a specific action in a state. We want to maximise this: $\pi^*(s)=\underset{a}{\mbox{argmax }}Q^*(s, a)$

In [28]:
input_shape = 4 # Input shape is the observations of the cartpole [pos, vel, ang, ang_vel]
output_shape = 2 # Output shape is the action space size {-1,1}

Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))


class ReplayMemory(object):
    """
    Replay buffer used to store the previous steps taken in the training algorithm. Uses the deque function.
    (Could change to using the Reverb library from DeepMind)
    """

    def __init__(self, capacity):
        self.memory = deque([],maxlen=capacity)

    def push(self, *args):
        """Save a transition"""
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

In [29]:
# Build a DQN model in PyTorch. This is done using a class function to create a model parameters

class DQN(nn.Module):
    """
    Very basic network that has all linear layers with an input of 4 (the observations) and an output of 2 (the actions)
    """
    
    def __init__(self, inputs, output):
        super(DQN, self).__init__()
        self.l_input = nn.Linear(inputs, 32)
        self.hidden_1 = nn.Linear(32, 32)
        self.l_output = nn.Linear(32, output)
        
        
    def forward(self, x):
        x = self.l_input(x)
        x = F.relu(x)
        x = self.hidden_1(x)
        x = F.relu(x)
        return self.l_output(x)
    
net = DQN(input_shape, output_shape)
net

DQN(
  (l_input): Linear(in_features=4, out_features=32, bias=True)
  (hidden_1): Linear(in_features=32, out_features=32, bias=True)
  (l_output): Linear(in_features=32, out_features=2, bias=True)
)

In [30]:
BATCH_SIZE = 128
GAMMA = 0.999
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200
TARGET_UPDATE = 10

policy_net = DQN(input_shape, output_shape).to(device)
target_net = DQN(input_shape, output_shape).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

DQN(
  (l_input): Linear(in_features=4, out_features=32, bias=True)
  (hidden_1): Linear(in_features=32, out_features=32, bias=True)
  (l_output): Linear(in_features=32, out_features=2, bias=True)
)

For this next part, we need to setup the optimiser for our model along with the function that determines taking a new step in the next direction. These functions will be adapted from the section in the "Hands-On ML" book, but using PyTorch for better future proofing

In [31]:
def select_action(state):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END) *  math.exp(-1*steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        with torch.no_grad():
            return policy_net(state).max(0)[1].view(1,1)
    else:
        return torch.tensor([[random.randrange(output_shape)]], device=device, dtype=torch.long)
    
def optimise_model():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    batch = Transition(*zip(*transitions))
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                            batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states = torch.cat([s for s in batch.next_state if s is not None])
    
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward = torch.cat(batch.reward)
    
    # Compute Q(s_t, a) for the model. 
    state_action_values = policy_net(state_batch).gather(1, action_batch)
    
    # Compute V(s_{t+1}) for all the next states
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
    
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch
    
    # Compute the Huber loss
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
    
    optimizer.zero_grad()
    loss.backward()
    
    for param in policy_net.parameters():
        param.grad.data.clamp_(-1,1)
        
    # Step the optimiser
    optimizer.step()

Final step is the actual training loop. This is the part that causes problems, so need to be careful how we do this. The observation comes in the form of a `nparray` whereas we want it in the form of a `torch.tensor`. This conversion has to be done before the model is called as otherwise it doesn't work

In [46]:
NUM_EPISODES = 5 # DOES NOT WORK AT NUMBERS GREATER THAN THIS
NUM_ITERATIONS = 200

episode_durations = []
steps_done = 0

optimizer = optim.RMSprop(policy_net.parameters())
memory = ReplayMemory(10000)

"""
TODO:
Need to find out why it is not a torch tensor. Do I need to convert everything to tensors to stop the error?
UPDATE 25/05/2022@17:00 - It seems I do need to change everything to tensors to get the error to go away, but it seems that the memory isn't clearing
"""
for i_episode in range(NUM_EPISODES):
    
    print(f"Episode Number: {i_episode}")
    
    # Reset the env at the beginning of the episode
    obs = env.reset()
    
    for idx in tqdm(range(NUM_ITERATIONS)):
        # Convert the state and get the action
        state = torch.from_numpy(obs)
        action = select_action(state)
        
        # Step the environment with the chosen action
        state_obs, reward, done, info = env.step(action.item())
        reward = torch.tensor([reward], device=device)
        
        # Check to see if the env is done or not
        if not done:
            next_state = torch.from_numpy(state_obs)
        else:
            next_state = None
        
        # Add this information to the buffer
        memory.push(state, action, next_state, reward)
        
        # Move onto the next state and optimise the model
        obs = state_obs
        optimise_model()
        
        if done:
            episode_durations.append(idx + 1)
            break;
    if i_episode & TARGET_UPDATE == 0:
        target_net.load_state_dict(policy_net.state_dict())
        
print("Finished training")             

Episode Number: 0


 14%|█▍        | 28/200 [00:00<00:00, 5011.33it/s]


Episode Number: 1


  6%|▋         | 13/200 [00:00<00:00, 2959.02it/s]


Episode Number: 2


 12%|█▏        | 24/200 [00:00<00:00, 5726.34it/s]


Episode Number: 3


  5%|▌         | 10/200 [00:00<00:00, 4599.02it/s]


Episode Number: 4


  8%|▊         | 16/200 [00:00<00:00, 5162.22it/s]

Finished training





In [33]:
env.reset()

for idx in range(MAX_EPISODES):
    obs = env.reset()
    for t in range(MAX_ITERATIONS):
        env.render()
        state = torch.from_numpy(obs)
        action = select_action(state)
        observation, reward, done, info = env.step(action.item())
        
        if done:
            print(observation)
            print(f"Epsiode finished after {t+1} timesteps")
            break
env.close()

[ 0.0768502   0.79675883 -0.21398349 -1.4539504 ]
Epsiode finished after 16 timesteps
[ 0.15789364  1.136877   -0.24816257 -2.0722616 ]
Epsiode finished after 16 timesteps
[ 0.1826065   1.3842472  -0.22382887 -2.2761858 ]
Epsiode finished after 17 timesteps
[ 0.2268194   1.7928833  -0.23885089 -2.757588  ]
Epsiode finished after 15 timesteps
[ 0.172525    1.5639651  -0.22253878 -2.5371559 ]
Epsiode finished after 10 timesteps
[ 0.17375931  1.1689367  -0.21678406 -2.0024483 ]
Epsiode finished after 12 timesteps
[ 0.20481385  1.7628695  -0.23886864 -2.7398944 ]
Epsiode finished after 13 timesteps
[ 0.12859298  1.560231   -0.21214361 -2.4135008 ]
Epsiode finished after 12 timesteps
[ 0.17984591  2.097453   -0.25302044 -3.2621856 ]
Epsiode finished after 13 timesteps
[ 0.15543552  1.3692449  -0.21164376 -2.229436  ]
Epsiode finished after 9 timesteps
[ 0.10136188  1.8122773  -0.24113558 -2.826864  ]
Epsiode finished after 9 timesteps
[ 0.13333464  0.9483672  -0.23620047 -1.8014073 ]
Epsiod

In [47]:
memory

<__main__.ReplayMemory at 0x7fadce6feac0>

In [49]:
transitions = memory.sample(5)
batch = Transition(*zip(*transitions))
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                        batch.next_state)), device=device, dtype=torch.bool)
non_final_next_states = torch.cat([s for s in batch.next_state if s is not None])

state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward = torch.cat(batch.reward)
action_batch

tensor([[0],
        [1],
        [0],
        [0],
        [1]])

In [51]:
policy_net(state_batch).gather(1, action_batch)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x20 and 4x32)