<a href="https://colab.research.google.com/github/daystram/ml-playground/blob/master/01_gym_cartpole/01b_gym_cartpole_dnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01b Gym CartPole using DNN

*https://github.com/daystram/ml-playground*

---

On this example, we will be using a unique approach to solve the CartPole problem. We would first train our model on the generated dataset observation-action pairs, and see how well the trained model will perform. It is similar to a **supervised learning** method, our model/agent will not be learning based on the reward *on-the-go*.

The model will be a deep neural network, which we will be using Keras with TensorFlow backend. The training dataset will be generated by running random actions on an environment beforehand, and collecting these observation-action pairs which yields a reward above a given treshold; we only want those which reach a reasonably decent reward with the random actions, we will see how the model can improve from it.



## 0. Initialization

We will clone the git repo to load the setup script and required modules. The repo should also be added to Python's module-path for easier module imports.

In [1]:
!rm -r ml-playground > /dev/null 2>&1
!git clone https://github.com/daystram/ml-playground.git
!sh ml-playground/helper/setup.sh > /dev/null 2>&1

Cloning into 'ml-playground'...
remote: Enumerating objects: 241, done.[K
remote: Counting objects: 100% (241/241), done.[K
remote: Compressing objects: 100% (181/181), done.[K
remote: Total 241 (delta 139), reused 115 (delta 50), pack-reused 0[K
Receiving objects: 100% (241/241), 509.23 KiB | 812.00 KiB/s, done.
Resolving deltas: 100% (139/139), done.


In [0]:
import warnings; warnings.filterwarnings('ignore')
import sys; sys.path.append('/content/ml-playground')

## 1. Training the Model

### a. Loading the Environment

Let's load the envorinment! We can see that the agent will return an observation for every step given the action. The observation will be a list of 4 elements, and the action will be a single discrete value of either 0 or 1. We will use this when we create our model.

In [3]:
import gym

env = gym.make("CartPole-v1")
print(env.observation_space)
print(env.action_space)

Box(4,)
Discrete(2)


### b. Generating Training Dataset

To generate this dataset, we will be doing random actions on the given environment. The properties are as follows:

- 200000 attempts to obtain the sequence of actions to the environment;
- 1000 steps of the simulation per sample (set this too low and we would have not *explored* the action space wide enough for good results);
- Goal treshold of 128, take the observation-action pairs which yields at lest this reward to add to the training data (set this too low and our data would have a low standard of the targetted goal, but if it's too high, our sample size would get too small for a good training);

We can see how many samples we collected, and the distribution of the rewards/goals that they achieve. These information can be used to tweak the parameters above to obtain a good balance of quality to quantity of out dataset.

In [4]:
%%time
import numpy as np
from tqdm import tqdm_notebook

acc_fitness = []
training_data = []

sample_count = 250000
steps = 1200
goal_tres = 120

for _ in tqdm_notebook(range(sample_count), desc="Generating"):
  fitness = 0
  action_seq = []
  prev_observation = env.reset()
  
  for _ in range(steps):
    action = np.random.randint(0, 2)
    action_seq.append([prev_observation, action])
    prev_observation, reward, done, _ = env.step(action)
    fitness += reward
    if done: break
    
  if fitness >= goal_tres:
    acc_fitness.append(fitness)
    for seq in action_seq:
      if seq[1] == 1:
        action = [0, 1]
      else:
        action = [1, 0]
      training_data.append([seq[0], action])
env.close()
print("{} samples".format(len(training_data)))
print(np.array(acc_fitness))

HBox(children=(IntProgress(value=0, description='Generating', max=250000, style=ProgressStyle(description_widt…


2510 samples
[120. 121. 153. 123. 122. 120. 134. 143. 120. 128. 127. 125. 141. 148.
 122. 135. 123. 122. 183.]
CPU times: user 55.1 s, sys: 2.27 s, total: 57.4 s
Wall time: 54.2 s


### c. Building the Model

As said earlier, we will be using Keras for a simple and quick prototyping of the model. Currently I do not have much knowledge and experience on how to create and tweak the topology of the neural network, so I will go with a simple generic approach to create one.

Keep in mind to adjust the input and output shape/dimension of the network based on our environment.

In [5]:
from keras.models import Sequential
from keras.layers import Dense, Dropout

drop_rate = 0.1
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(4,)))
model.add(Dropout(drop_rate))
model.add(Dense(256, activation='relu'))
model.add(Dropout(drop_rate))
model.add(Dense(512, activation='relu'))
model.add(Dropout(drop_rate))
model.add(Dense(256, activation='relu'))
model.add(Dropout(drop_rate))
model.add(Dense(128, activation='relu'))
model.add(Dropout(drop_rate))
model.add(Dense(2, activation='relu'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


### d. Training

Let's begin training our model!

Notice that we are not using any information of the reward given/yielded by the environement, this approach can be useful for those environments which does not have a clear/suitable reward system for our model to train on. This is contrasting to [#01a](https://github.com/daystram/ml-playground/blob/master/01_gym_cartpole/01a_gym_cartpole_neuroevolution.ipynb) where we rely heavily on the given reward.

In [6]:
%%time
X = np.array([i[0] for i in training_data])
y = np.array([i[1] for i in training_data])

model.fit(X, y, epochs=20, batch_size=32)

Instructions for updating:
Use tf.cast instead.
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
CPU times: user 10.5 s, sys: 1.28 s, total: 11.8 s
Wall time: 9.62 s


### e. Evaluation

Our model does not seem to have a high accuracy as seen on the training output above, but we can see how the loss has been dropping. The trining process aims to reduce the loss instead of increasing the accuracy, that is why __choosing the correct error function is essential for a great training result__.

It also seemed to perform well on the environment given the fact that the training dataset were far from perfect. **I think its because the model treats/learns from the dataset as a whole collection of observation-action pairs (instead of individual runs), that it learns the best patterns from it, resulting in a generally better fit to the data.**

In [9]:
from helper import nbdisplay as disp

!rm -r video
env = disp.wrap_env(gym.make("CartPole-v1"))
observation = env.reset()
fitness = 0
for _ in range(steps):
  action = np.argmax(model.predict(np.array([observation])))
  observation, reward, done, _ = env.step(action)
  fitness += reward
  if done: break

env.close()
disp.show_video()
print("Video can be viewed in Google Colab, use the link on top of the page to open")
print("------ Agent: Reward {}".format(fitness))  

Video can be viewed in Google Colab, use the link on top of the page to open
------ Agent: Reward 500.0


Alright, save that model.

In [0]:
import pickle
import time

filename = '02-agent-{}.model'.format(time.strftime("%Y%m%d_%H%M%S"))
file = open(filename, 'wb')
pickle.dump(model, file=file)
file.close()