<a href="https://colab.research.google.com/github/Welwi/RL_typo/blob/master/RL_typo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, I am training a RL algorithm that will master two varying envs with varying complexity.

ENV 1: Cartpole
The goal is to balance a pole, portruding from a cart, in an upright position by only movign the base left or right. This is an env with a low-dimensional observation space.

ENV2: Pong
The goal is to beat the competition. The env has a high-dimensional observation space - learning directly from raw pixels.

In [1]:

!apt-get install -y xvfb python-opengl x11-utils > /dev/null 2>&1
!pip install gym pyvirtualdisplay scikit-video > /dev/null 2>&1

!pip install mitdeeplearning

Collecting mitdeeplearning
[?25l  Downloading https://files.pythonhosted.org/packages/8b/3b/b9174b68dc10832356d02a2d83a64b43a24f1762c172754407d22fc8f960/mitdeeplearning-0.1.2.tar.gz (2.1MB)
[K     |████████████████████████████████| 2.1MB 2.8MB/s 
Building wheels for collected packages: mitdeeplearning
  Building wheel for mitdeeplearning (setup.py) ... [?25l[?25hdone
  Created wheel for mitdeeplearning: filename=mitdeeplearning-0.1.2-cp36-none-any.whl size=2114586 sha256=e50c07e7648636502796948e84070ccc05f8050fa1c44bcadfa797676ae1ec67
  Stored in directory: /root/.cache/pip/wheels/27/e1/73/5f01c787621d8a3c857f59876c79e304b9b64db9ff5bd61b74
Successfully built mitdeeplearning
Installing collected packages: mitdeeplearning
Successfully installed mitdeeplearning-0.1.2


## Steps of RL probs in general:
1. Initialize the env and the agent: describe the different observations and actions the agent can make in the env.

2. Define the agent's memory: this will enable the agent to remember its past actions, observations and rewards

3. Define a reward function: describes the reward associated with an action or sequence of actions

4. Define the learning algorithm: this is used to reinforce the agent's good behavior and discourage the bad behaviors.

In [2]:
import tensorflow as tf
import numpy as np

import base64, io, time, gym
import IPython, functools

import matplotlib.pyplot as plt

from tqdm import tqdm

import mitdeeplearning as mdl

## PART 1: CARTPOLE

Gym is a toolkit that has several pre-defined environments for training and testing RL learning agents.

In Cartpole, the pole starts upright and the goal is to prevent it from falling. A reward of +1 is given for every timestep that the pole remains upright.
A reward of -1 is given if the the pole is more than 15 degrees from the vertical or if the cart moves more than 2.4 units from the center of the track.

In [3]:
# Instantiating the cartpole env

env = gym.make('CartPole-v0')
env.seed(1)

[1]

Observations that help define the env:
1. cart position
2. cart velocity
3. pole angle
4. pole rotation rate

Actions that the agent can take:
- The agent can move either right or left.

This shows that this is a low-dimensional observation and action spaces.

In [4]:
# Checking the size of the space
n_observations = env.observation_space
print('Env has observation space =', n_observations)

# Checking the num of actions that the agent can take
n_actions = env.action_space.n
print('Num of possible actions that the agent can choose from =', n_actions)

Env has observation space = Box(4,)
Num of possible actions that the agent can choose from = 2


In [5]:
n_actions

2

### Defining the agent

In RL, a deep neural network defines the agent.
This network takes in an observation of the environment, and outputs the probability of taking each of the possible actions.
Since this is a low dimensional observation space, we can use a simple feed forward NN

In [6]:
# Defining the carpole agent
def create_carpole_model():

  model = tf.keras.models.Sequential([
                                      tf.keras.layers.Dense(units=32, activation='relu'),
                                      tf.keras.layers.Dense(units=n_actions, activation= None)
  ])

  return model

cartpole_model = create_carpole_model()

Defining a feed forward pass through the network (action function)

- takes observations as inputs
- does a forward pass through the model
- outputs the agent action


In [7]:
def choose_action(model, observation):

  # adding the batch dimension to the observation
  observation = np.expand_dims(observation, axis=0)

  # passing the observation through the model
  logits = model.predict(observation)

  # pass the probabilities through softmax to get true probability
  prob_weights = tf.nn.softmax(logits).numpy()

  # random selection of an action of an action from observation
  action = np.random.choice(n_actions, size=1, p=prob_weights.flatten())[0]

  return action

## Agents Memory

Unables agent to remember past actions, observations and rewards.

In RL:
- Training happens alongside the agent's acting in the env
- Episode: sequence of actions that ends in a terminal state (pole falling or cart crashing)
- The agent needs to remember all of the observations and actions for the episode to reinforce or punish the actions.

## Breaking it down

In [8]:
# resets the state of the env and returns an initial observation
observation = env.reset()

In [9]:
observation

array([ 0.03073904,  0.00145001, -0.03088818, -0.03131252])

In [10]:
observation.shape

(4,)

In [11]:
# adding batch dimension to the obs
observation = np.expand_dims(observation, axis=0)

In [12]:
observation.shape

(1, 4)

In [13]:
# passing obs through the model
logits = cartpole_model(observation)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



In [14]:
logits

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-0.00845183,  0.02072986]], dtype=float32)>

In [15]:
  # pass the probabilities through softmax to get true probability
  prob_weights = tf.nn.softmax(logits).numpy()

In [16]:
prob_weights

array([[0.49270508, 0.5072949 ]], dtype=float32)

In [17]:
prob_weights.flatten()

array([0.49270508, 0.5072949 ], dtype=float32)

In [18]:
prob_weights.shape

(1, 2)

In [19]:
prob_weights.flatten().shape

(2,)

In [20]:
action = np.random.choice(n_actions, size=1, p=prob_weights.flatten())

In [21]:
action

array([1])

In [22]:
action[0]

1

In [23]:
action = np.random.choice(n_actions, size=1, p=prob_weights.flatten())[0]