# Learning to use Gym, PyTorch and StableBaselines3 for Reinforcement learning
Very simple notebook for learning how to use the three tools mentioned above for reinforcement learning. I might even throw in Weights & Biases if I'm feeling lucky and then eventually move onto using MuJoCo for better physics simulation. There is a lot to do here so need to be ready for a lot of fighting
## To-Do List
- [ ] Fix the error where no `torch.tensor` is being pased
- [ ] Run the simulation and try and get a decent model
- [ ] Evaluate the model on a better set of variables
- [ ] Trasnfer to the GPU (somehow? Don't know if CUDA is installed)

In [1]:
import numpy as np
import sklearn as sk
import matplotlib as plt
from tqdm import tqdm
import gym
import matplotlib.pyplot as plt

MAX_EPISODES = 20
MAX_ITERATIONS = 100

gym.__version__


'0.21.0'

In [2]:
# Start with just the simple cartpole problem

env = gym.make('CartPole-v1')
env.reset()

for idx in range(MAX_EPISODES):
    observation = env.reset()
    for t in range(MAX_ITERATIONS):
        env.render()
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        
        if done:
            print(observation)
            print(f"Epsiode finished after {t+1} timesteps")
            break
env.close()

[-0.03778828  0.21992627 -0.21858265 -0.99981505]
Epsiode finished after 35 timesteps
[-0.13366812 -0.41815755  0.2115568   1.1740344 ]
Epsiode finished after 30 timesteps
[-0.19203323 -0.5585572   0.21386503  1.1523181 ]
Epsiode finished after 13 timesteps
[-0.07295746  0.06602277 -0.22327465 -1.3058242 ]
Epsiode finished after 42 timesteps
[-0.07880972 -0.97056675  0.2420936   1.7659268 ]
Epsiode finished after 9 timesteps
[ 0.2104759  1.7434605 -0.2310046 -2.7885432]
Epsiode finished after 13 timesteps
[-0.12281333 -0.9711392   0.21947926  1.7758803 ]
Epsiode finished after 15 timesteps
[-0.05352812 -0.40221405  0.2128541   1.0717787 ]
Epsiode finished after 18 timesteps
[-0.13433659 -0.3704068   0.21527629  1.0174774 ]
Epsiode finished after 14 timesteps
[ 0.13332753  0.22601157 -0.228375   -0.8359939 ]
Epsiode finished after 25 timesteps
[-0.09054015 -0.7553634   0.2110363   1.6617942 ]
Epsiode finished after 16 timesteps
[ 0.10031619 -0.02151084 -0.21141703 -0.5745544 ]
Epsiode f

In [3]:
"""
Information regarding environment:

Observation space:
(4,) array with elements: [position, velocity, angle, angular velocity]

Action space:
(1,) array that is in the range {0,1} (DISCRETE)
"""

import math 
import random
from collections import namedtuple, deque
from itertools import count

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


device = torch.device("cpu")
print(torch.__version__)

1.11.0


  from .autonotebook import tqdm as notebook_tqdm


The idea is to convert the work in the "Hands-On ML" book to work in PyTorch. I'd rather work in PyTorch simply for my own work. For deep Q-learning, we get a function $Q^*:\textit{State}\times\textit{Action}\rightarrow \mathbb{R}$ which gives us the return for a specific action in a state. We want to maximise this: $\pi^*(s)=\underset{a}{\mbox{argmax }}Q^*(s, a)$

In [4]:
input_shape = 4 # Input shape is the observations of the cartpole [pos, vel, ang, ang_vel]
output_shape = 2 # Output shape is the action space size {-1,1}

Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))


class ReplayMemory(object):
    """
    Replay buffer used to store the previous steps taken in the training algorithm. Uses the deque function.
    (Could change to using the Reverb library from DeepMind)
    """

    def __init__(self, capacity):
        self.memory = deque([],maxlen=capacity)

    def push(self, *args):
        """Save a transition"""
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

In [37]:
# Build a DQN model in PyTorch. This is done using a class function to create a model parameters

class DQN(nn.Module):
    """
    Very basic network that has all linear layers with an input of 4 (the observations) and an output of 2 (the actions)
    """
    
    def __init__(self, inputs, output):
        super(DQN, self).__init__()
        self.l_input = nn.Linear(inputs, 32)
        self.hidden_1 = nn.Linear(32, 32)
        self.l_output = nn.Linear(32, output)
        
        
    def forward(self, x):
        x = self.l_input(x)
        x = F.relu(x)
        x = self.hidden_1(x)
        x = F.relu(x)
        return self.l_output(x)
    
net = DQN(input_shape, output_shape)
net

DQN(
  (l_input): Linear(in_features=4, out_features=32, bias=True)
  (hidden_1): Linear(in_features=32, out_features=32, bias=True)
  (l_output): Linear(in_features=32, out_features=2, bias=True)
)

In [6]:
BATCH_SIZE = 128
GAMMA = 0.999
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200
TARGET_UPDATE = 10

policy_net = DQN(input_shape, output_shape).to(device)
target_net = DQN(input_shape, output_shape).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

DQN(
  (l_input): Linear(in_features=4, out_features=32, bias=True)
  (hidden_1): Linear(in_features=32, out_features=32, bias=True)
  (l_output): Linear(in_features=32, out_features=2, bias=True)
)

For this next part, we need to setup the optimiser for our model along with the function that determines taking a new step in the next direction. These functions will be adapted from the section in the "Hands-On ML" book, but using PyTorch for better future proofing

In [50]:
def select_action(state):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END) *  math.exp(-1*steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        with torch.no_grad():
            return policy_net(state).max(0)[1].view(1,1)
    else:
        return torch.tensor([[random.randrange(output_shape)]], device=device, dtype=torch.long)
    
def optimise_model():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    batch = Transition(*zip(*transitions))
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                            batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states = torch.cat([s for s in batch.next_state if s is not None])
    
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward = torch.cat(batch.reward)
    
    # Compute Q(s_t, a) for the model. 
    state_action_values = policy_net(state_batch).gather(1, action_batch)
    
    # Compute V(s_{t+1}) for all the next states
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
    
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch
    
    # Compute the Huber loss
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
    
    optimizer.zero_grad()
    loss.backward()
    
    for param in policy_net.parameters():
        param.grad.data.clamp_(-1,1)
        
    # Step the optimiser
    optimizer.step()
    
    
episode_durations = []

Final step is the actual training loop. This is the part that causes problems, so need to be careful how we do this. The observation comes in the form of a `nparray` whereas we want it in the form of a `torch.tensor`. This conversion has to be done before the model is called as otherwise it doesn't work

In [56]:
NUM_EPISODES = 5
NUM_ITERATIONS = 200

"""
TODO:
Need to find out why it is not a torch tensor. Do I need to convert everything to tensors to stop the error?
UPDATE 25/05/2022@17:00 - It seems I do need to change everything to tensors to get the error to go away, but it seems that the memory isn't clearing
"""
for i_episode in range(NUM_EPISODES):
    
    # Reset the env at the beginning of the episode
    obs = env.reset()
    
    for idx in tqdm(range(NUM_ITERATIONS)):
        # Convert the state and get the action
        state = torch.tensor(obs)
        action = select_action(state)
        
        # Step the environment with the chosen action
        obs, reward, done, info = env.step(action.item())
        
        # Check to see if the env is done or not
        if not done:
            next_state = obs
        else:
            next_state = None
        
        # Add this information to the buffer
        memory.push(state, action, torch.tensor(next_state), torch.tensor(reward))
        
        # Move onto the next state and optimise the model
        obs = state
        optimise_model()
        
        if done:
            episode_durations.append(idx + 1)
            break;
    if i_episode & TARGET_UPDATE == 0:
        target_net.load_state_dict(policy_net.state_dict())
        
print("Finished training")             

  0%|          | 0/200 [00:00<?, ?it/s]


TypeError: expected Tensor as element 0 in argument 0, but got numpy.ndarray